How to handle NILs with Anemone / Nokogiri web scraper? - ruby

def scrape!(url)
Anemone.crawl(url) do |anemone|
anemone.on_pages_like %[/events/detail/.*] do |page|
show = {
headliner: page.doc.at_css('h1.summary').text,
openers: page.doc.at_css('.details h2').text
}
puts show
end
end
end
Writing a scraper in Anemone, which uses Nokogiri under the hood..
Sometime the selector .details h2'returns nothing because its not in the HTML, and calling text on it throws an exception.
I'd like to avoid if/elses all over the place...
if page.doc.at_css('.details h2').empty?
openers: page.doc.at_css('.details h2').text
end
Is there any more eloquent way of handling errors produced by inconsistant mark up? For instance CoffeeScript has the existentional operator person.name?.first(). If the HTML has the element, great make the object and call text on it. If not, move on and dont add it to the hash.

You just need do:
anemone.on_pages_like %[/events/detail/.*] do |page|
if not page.nil?
...#your code
end
end

Related

Nokogiri - Checking if the value of an xpath exists and is blank or not in Ruby

I have an XML file, and before I process it I need to make sure that a certain element exists and is not blank.
Here is the code I have:
CSV.open("#{csv_dir}/products.csv","w",{:force_quotes => true}) do |out|
out << headers
Dir.glob("#{xml_dir}/*.xml").each do |xml_file|
gdsn_doc = GDSNDoc.new(xml_file)
logger.info("Processing xml file #{xml_file}")
:x
#desc_exists = #gdsn_doc.xpath("//productData/description")
if !#desc_exists.empty?
row = []
headers.each do |col|
row << product[col]
end
out << row
end
end
end
The following code is not working to find the "description" element and to check whether it is blank or not:
#desc_exists = #gdsn_doc.xpath("//productData/description")
if !#desc_exists.empty?
Here is a sample of the XML file:
<productData>
<description>Chocolate biscuits </description>
<productData>
This is how I have defined the class and Nokogiri:
class GDSNDoc
def initialize(xml_file)
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
#doc.remove_namespaces!
The code had to be moved up to an earlier stage, where Nokogiri was initialised. It doesn't get runtime errors, but it does let XML files with blank descriptions get through and it shouldn't.
class GDSNDoc
def initialize(xml_file)
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
#doc.remove_namespaces!
desc_exists = #doc.xpath("//productData/descriptions")
if !desc_exists.empty?
You are creating your instance like this:
gdsn_doc = GDSNDoc.new(xml_file)
then use it like this:
#desc_exists = #gdsn_doc.xpath("//productData/description")
#gdsn_doc and gdsn_doc are two different things in Ruby - try just using the version without the #:
#desc_exists = gdsn_doc.xpath("//productData/description")
The basic test is to use:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<productData>
<description>Chocolate biscuits </description>
<productData>
EOT
# using XPath selectors...
doc.xpath('//productData/description').to_html # => "<description>Chocolate biscuits </description>"
doc.xpath('//description').to_html # => "<description>Chocolate biscuits </description>"
xpath works fine when the document is parsed correctly.
I get an error "undefined method 'xpath' for nil:NilClass (NoMethodError)
Usually this means you didn't parse the document correctly. In your case it's because you're not using the right variable:
gdsn_doc = GDSNDoc.new(xml_file)
...
#desc_exists = #gdsn_doc.xpath("//productData/description")
Note that gdsn_doc is not the same as #gdsn_doc. The later doesn't appear to have been initialized.
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
While that should work, it's idiomatic to write it as:
#doc = Nokogiri::XML(File.read(xml_file))
File.open(...) do ... end is preferred if you're processing inside the block and want Ruby to automatically close the file. That isn't necessary when you're simply reading then passing the content to something else for processing, hence the use of File.read(...) which slurps the file. (Slurping isn't necessary a good practice because it can have scalability problems, but for reasonable sized XML/HTML it's OK because it's easier to use DOM-based parsing than SAX.)
If Nokogiri doesn't raise an exception it was able to parse the content, however that still doesn't mean the content was valid. It's a good idea to check
#doc.errors
to see whether Nokogiri/libXML had to do some fix-ups on the content just to be able to parse it. Fixing the markup can change the DOM from what you expect, making it impossible to find a tag based on your assumptions for the selector. You could use xmllint or one of the XML validators to check, but Nokogiri will still have to be happy.
Nokogiri includes a command-line version nokogiri that accepts a URL to the document you want to parse:
nokogiri http://example.com
It'll open IRB with the content loaded and ready for you to poke at it. It's very convenient when debugging and testing. It's also a decent way to make sure the content actually exists if you're dealing with HTML containing DHTML that loads parts of the page dynamically.

Foreach loop in XML generator not breaking

I am trying to generate XML, but the loop isn't breaking. Here is a part of the code:
#key = 0
#cont.each do |pr|
xml.product {
#key += 1
puts #key.to_s
begin
#main = Nokogiri::HTML(open(#url+pr['href'], "User-Agent" => "Ruby/#{RUBY_VERSION}","From" => "foo#bar.invalid", "Referer" => "http://www.ruby-lang.org/"))
rescue
puts "rescue"
next
end
puts pr['href']
puts #key.to_s
break //this break doesn't work
#something else
}
end
Most interesting is that in the final generated XML file, break worked. The file contains only one product, but on the console #key was printed fully, which means the foreach loop doesn't break.
Could it be a Nokogiri XML-specific error, because of open brackets in the head of the loop?
In general I think how you're going about trying to generate the XML is confused. Don't convolute your code any more than necessary; Instead of starting to generate some XML then aborting it inside the block because you can't find the page you want, grab the pages you want first, then start processing.
I'd move the begin/rescue block outside the XML generation. Its existence inside the XML generation block results in poor logic and questionable practices of using next and break. Instead I'd recommend something like this untested code:
#main = []
#cont.each do |pr|
begin
#main << Nokogiri::HTML(
open(#url + pr['href'])
)
rescue
puts 'rescue'
next
end
end
builder = Nokogiri::XML::Builder.new do |xml|
xml.root {
xml.products {
#main.each do |m|
xml.product {
xml.id_ m.at('id').text
xml.name m.at('name').text
}
end
}
}
end
puts builder.to_xml
Which makes it easy to see that the code is keying off being able to retrieve a page.
This code is untested because we have no idea what your input values are or what your output should look like. Having valid input, expected output and a working example of your code that demonstrates the problem is essential if you want help debugging a problem with your code.
The use of #url + pr['href'] isn't generally a good idea. Instead use the URI class to build up the URL for you. URI handles encoding and ensures the URI is valid.

Insert HAML into a Sinatra helper

I'm writing a helper for a small Sinatra app that prints some gaming cards stored as hash in an array.
Every card has this structure:
{ card: 'Ace', suit: :spades, value: 11 }
and the filename of the card image is "spades_11.jpg".
I'm writing a helper to display the cards in my view:
def view(hand)
hand.each do |card|
#print the card
end
end
I need an output like this:
.span2
%img(src="/images/#{card[:suite]}_#{card[:value]}")
How can I insert my Haml code inside the helper block keeping the indentation?
The simplest solution would be to just return the HTML directly from your helper as a string:
def view(hand)
hand.map do |card|
"<div class='span2'><img src='/images/#{card[:suite]}_#{card[:value]}'></div>"
end.join
end
The call it from your Haml with something like:
= view(#the_hand)
You could make use of the haml_tag helper which would let you write something like:
def view(hand)
hand.each do |card|
haml_tag '.span2' do
haml_tag :img, 'src' => "/images/#{card[:suite]}_#{card[:value]}"
end
end
end
Note that haml_tag writes directly to the output rather than returning a string, so you would have to use it with - rather than =:
- view(#the_hand)
or use capture_haml.
This method means your helper depends on Haml. The first method would be usable whatever template language you used, but wouldn’t respect settings like format for whether to end the img tag with />.
If you want to use pure Haml for the markup for each card (this example is simple enough to get away with helpers, but you would certainly want to do this for more complex sections) you could use a partial. Add you Haml code to a file named e.g. view.haml, then you can render it from the containing template, passing in the hand as a local variable:
view.haml:
- hand.each do |card|
.span2
%img(src="/images/#{card[:suite]}_#{card[:value]}")
Parent template:
= haml :view, :locals => {:hand => #the_hand}
You should be able to use a here doc
def view(hand)
hand.each do |card|
<<-HAML
.span2
%img(src="/images/#{card[:suite]}_#{card[:value]}")
HAML
end
end
but note that here docs take the whitespace from the start of the line the are on, so unfortunately this will make your indentation somewhat ugly.
For anything more complicated it probably makes sense to write your haml in a separate .haml file.

Element not found in the cache - perhaps the page has changed since it was looked up in Selenium Ruby web driver?

I am trying to write a crawler that crawls all links from loaded page and logs all request and response headers along with response body in some file say XML or txt. I am opening all links from first loaded page in new browser window so I wont get this error:
Element not found in the cache - perhaps the page has changed since it was looked up
I want to know what could be the alternate way to make requests and receive response from all links and then locate input elements and submit buttons form all opened windows.
I am able to do above to some extent except when opened window has common site searh box like one on this http://www.testfire.net in the upper right corner.
What I want to do is I want to omit such common boxes so that I can fill other inputs with values using i.send_keys "value" method of webdriver and dont get this error
ERROR: Element not found in the cache - perhaps the page has changed since it was looked up.
What is the way to detect and distinguish input tags from each opened window so that value does not get filled repeatably in common input tags that appear on most pages of website.
My code is following:
require 'rubygems'
require 'selenium-webdriver'
require 'timeout'
class Clicker
def open_new_window(url)
#driver = Selenium::WebDriver.for :firefox
#url = #driver.get " http://test.acunetix.com "
#link = Array.new(#driver.find_elements(:tag_name, "a"))
#windows = Array.new(#driver.window_handles())
#link.each do |a|
a = #driver.execute_script("var d=document,a=d.createElement('a');a.target='_blank';a.href=arguments[0];a.innerHTML='.';d.body.appendChild(a);return a", a)
a.click
end
i = #driver.window_handles
i[0..i.length].each do |handle|
#driver.switch_to().window(handle)
puts #driver.current_url()
inputs = Array.new(#driver.find_elements(:tag_name, 'input'))
forms = Array.new(#driver.find_elements(:tag_name, 'form'))
inputs.each do |i|
begin
i.send_keys "value"
puts i.class
i.submit
rescue Timeout::Error => exc
puts "ERROR: #{exc.message}"
rescue Errno::ETIMEDOUT => exc
puts "ERROR: #{exc.message}"
rescue Exception => exc
puts "ERROR: #{exc.message}"
end
end
forms.each do |j|
begin
j.send_keys "value"
j.submit
rescue Timeout::Error => exc
puts "ERROR: #{exc.message}"
rescue Errno::ETIMEDOUT => exc
puts "ERROR: #{exc.message}"
rescue Exception => exc
puts "ERROR: #{exc.message}"
end
end
end
#Switch back to the original window
#driver.switch_to().window(i[0])
end
end
ol = Clicker.new
url = ""
ol.open_new_window(url)
Guide me how can I get all requeat and response headers with response body using Selenium Webdriver or using http.set_debug_output of ruby's net/http ?
Selenium is not one of the best options to use to attempt to build a "web-crawler". It can be too flakey at times, especially when it comes across unexpected scenarios. Selenium WebDriver is a great tool for automating and testing expectancies and user interactions.
Instead, good old fashioned curl would probably be a better option for web-crawling. Also, I am pretty sure there are some ruby gems that might help you web-crawl, just Google search it!
But To answer the actual question if you were to use Selenium WebDriver:
I'd work out a filtering algorithm where you can add the HTML of an element that you interact with to an variable array. Then, when you go on to the next window/tab/link, it would check against the variable array and skip the element if it finds a matching HTML value.
Unfortunately, SWD does not support getting request headers and responses with its API. The common work-around is to use a third party proxy to intercept the requests.
============
Now I'd like to address a few issues with your code.
I'd suggest before iterating over the links, add a #default_current_window = #driver.window_handle. This will allow you to always return back to the correct window at the end of your script when you call #driver.switch_to.window(#default_current_window).
In your #links iterator, instead of iterating over all the possible windows that could be displayed, use #driver.switch_to.window(#driver.window_handles.last). This will switch to the most recently displayed new window (and it only needs to happen once per link click!).
You can DRY up your inputs and form code by doing something like this:
inputs = []
inputs << #driver.find_elements(:tag_name => "input")
inputs << #driver.find_elements(:tag_name => "form")
inputs.flatten
inputs.each do |i|
begin
i.send_keys "value"
i.submit
rescue e
puts "ERROR: #{e.message}"
end
end
Please note how I just added all of the elements you wanted SWD to find into a single array variable that you iterate over. Then, when something bad happens, a single rescue is needed (I assume you don't want to automatically quit from there, which is why you just want to print the message to the screen).
Learning to DRY up your code and use external gems will help you achieve a lot of what you are trying to do, and at a faster pace.

Rails 3 and html_safe confusion (allow pictures (smiles) in chat but deny everything else)

I have here is a module that replaces the smilies (like ":-)") as icons:
module Smileize
PATH = "/images/smiles"
SMILES = [/\;\-?p/i, /\$\-?\)/, /8\-?\)/, /\>\:\-?\(/, /\:\-?\*/, /\:\-?o/i, /\:\-?c/i, /\;\-?\)/,
/\:\-?s/i, /\:\-?\|/, /\:\-?p/i, /\:\-?D/i, /\:\-?\?/, /\:\-?\(/, /\:\-?\)/]
def to_icon(key)
return "<img class='smiley' src='#{PATH}/smile#{SMILES.index(key) + 1}.png'/>"
end
module_function :to_icon
end
class String
def to_smile
Smileize::SMILES.each do |smile|
if self =~ smile
self.gsub!(smile, Smileize.to_icon(smile))
end
end
self
end
end
So pictures show that I'm using html_safe, like this:
<%= #message.text.to_smile.html_safe %>
But it does not suit me, because but pictures will be displayed and other tags, too.
My question is: how to display only my smile, ignoring the other tags?
I think you'll need to do it like this:
HTML encode the string.
Perform your substitution.
Mark the final result as HTML safe.
Add a helper something like this:
def expand_smilies(s)
s = ERB::Util::html_escape(s)
Smileize::SMILES.each do |smile|
s.gsub!(smile, Smileize.to_icon(smile))
end
s.html_safe
end
And then in your ERB:
<%= expand_smilies some_text %>
ERB uses ERB::Util::html_escape to encode HTML so using it yourself makes sense if you're targeting ERB. Calling html_safe on a string returns you something that ERB will leave alone when it is HTML encoding things.
Note that there is no usable html_safe! on strings and html_safe returns an ActiveSupport::SafeBuffer rather than a String so you'll have to use a helper rather than monkey patching a new method into String. ActiveSupport does patch an html_safe! method into String but all it does is raise an exception saying "don't do that":
def html_safe!
raise "You can't call html_safe! on a String"
end

Resources