Ruby, Cucumber and Watir Automation Scripting Basics - ruby

Thank y'all all within the community and the moderators for being so cool and willing to help so quickly! Just wanted to lead in with that. So I could really use some help with this basic automation script that I am running. I am trying to select the search bar on Google.com and enter some text. I have gotten some help from friends but they were stuck as well. But it learning this I was hoping to get some help from the experts and just ask these questions that I have because Google ain't got shit!
1) How to select the search field and enter text.
Mine looks something like this I've tried xpath, different values, id's classes.
require 'ruby'
require 'watir-webdriver'
browser = Browser::browser.new :firefox
browser.goto 'http://google.com'
browser.text_field(:value => 'Search').set('google search')
2) When I inspect the element and find unique characteristics to that value (i.e. href, id, title, class, name), which are the ones that I can actually utilize to call either the button, text_field, or link?
3) I understand html and css pretty well. Can someone please explain how to properly utilize xpath?
Y'all rock, I feel like there are tons of people out there who have these same questions as I do, and can't find the damn answers anywhere, so I ask all of you automation experts, would you mind dropping a knowledge bomb and learnin us?

There are a number of errors in this script. Here's a working version:
# require 'ruby' # don't require ruby
require 'watir-webdriver' # corrected typo in gem name
browser = Watir::Browser.new :firefox # corrected Browser::browser.new
browser.goto 'http://google.com'
browser.text_field(:title => 'Search').set('google search') # changed :value to :title
In terms of identifying page elements, it's generally considered good practice to use the id attribute since it's unique to the page. You can use the attributes that you've listed, but they have to exist as attributes for the given HTML element. AFAIK, watir-webdriver supports using the majority of standard HTML tag attributes for location and can also locate elements based on their index, via regular expression, and by combining multiple locators. For example, you could substitute any of these in the script above:
browser.text_field(:title => /Search/).set('google search')
browser.text_field(:class => 'gsfi').set('google search')
browser.text_field(:id => 'lst-ib').set('google search')
browser.text_field(:name=> 'q', :class => 'gsfi').set('google search')
If you haven't already, I'd suggest checking out http://watirwebdriver.com/ and https://github.com/watir/watir/wiki. And if you're curious about using xpath to find tricky elements, check out
https://jkotests.wordpress.com/2012/08/28/locate-element-via-custom-attribute-css-and-xpath/.

Related

Accessing and scraping sporadically available Wikipedia sections

I need to fetch some data but I'm completely stumped after trying a few things.
I want to access Airlines & Destinations from the Albuquerque_International_Sunport's wiki page - keep in mind, I'll be going through a prepopulated list of airports with this data.
There are multiple "types" of Airlines: Passenger, Cargo, sometimes there's other (sub?)sections; other times there are none:
Articles for multiple airports will be accessed automatically - including some less known airports. This means I need to:
Check if "Airlines & Destinations" section exists
Take all data inside of any table
Scrape it; otherwise do nothing
I've tried using the ruby wikipedia-client gem however, the .raw_data method isn't even returning the section data:
Next, I went to Wikipedia's API: unless I am mistaken, but it doesn't return "section" names! This doesn't seem right but I wasn't able to get it working.
So I suppose that leaves Nokogiri. I can grab and parse the pages fine, but:
How would I go about detecting "Airlines & Destinations" section presence, getting all table data BEFORE end of section? I have a suspicion I need some tricky Xpath for this.
Seems to be the only viable solution.
Any thoughts welcome. Putting a bounty on this question when I can.
Edit: Perhaps it's better to simply somehow grab a list of all airlines in the world and hit them against HTML? Seems like it could be computationally expensive.
Well, I'm not an expert user of Nokogiri but maybe this can give you some idea.
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("https://en.wikipedia.org/wiki/Albuquerque_International_Sunport"))
# this is the passenger table
page.xpath('//*[#id="mw-content-text"]/div/table[2]/tr').each do |tr|
p tr.text()
puts "-"*50
end
# this is the cargo table
page.xpath('//*[#id="mw-content-text"]/div/table[3]/tr').each do |tr|
p tr.text()
puts "-"*50
end

How can I scrape the images from a sub-reddit?

Given a subreddit like /r/pics, how can I scrape all the images in Ruby?
I looked through Reddit's API, but there doesn't seem to be anything for this. But a site like "redditery" is already doing this - http://www.redditery.com/r/aww
Check out nokogiri it will be able to perform this task.
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.reddit.com/r/aww"))
doc.css('div#siteTable').css('a').each {|x| puts x['href']}
That should output links to images (This code isn't tested but should be pretty close)

Getting HTML table values using XPath in Nokogiri?

I'm trying to get some values from a table using the XPath of this table but it only returns [] (empty):
require 'nokogiri'
require 'open-uri'
url = "http://riopretrans.com.br/linhas.php?ln=106"
doc = Nokogiri::HTML(open(url))
doc.xpath("html/body/table[1]/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[2]/td/div/table[1]/tbody/tr[3]/td/div/div/center/font/table").each do |lines|
puts lines.content
end
I found the table's XPath using Firebug so I think it's correct.
Can anyone help me?
Remove tbody/ from your XPath.
The tbody tag is part of the HTML spec for table tags, but it's rarely actually implemented in the HTML. Some browsers insert it, though it's not in the HTML for the page. Firebug then sees it, which you see, and think it must be so.
Even using "view source" can confuse you, because you expect that to be accurate, but the browser has already munged the content to include "tbody", so, well, basically they're lying to you.
You can confirm this by looking at the HTML that Nokogiri is getting. Use puts doc.to_html['tbody'] and see if you get "tbody" or nil.
...Because in html file all of them were specified(written by programmer)
If you are positive they actually belong there, because they exist in the HTML source, then you'll need to take apart your XPath. Start with a broad path, and slowly add to it to narrow down your search.
The server is unreachable for me right now, so I can't confirm that, or dig into what the hierarchy should be, and show an example. (That's why actually giving us REAL HTML in your question is SO much better than a link which might not work.)
An alternate is to use XPath's // (search anywhere) with a less restrictive path, or CSS selectors. Either way, actually examine the HTML, instead of relying on Firebug's XPath, and determine what "landmarks" you can use in the source to navigate to your desired table. Today's HTML is chock-full of id and class parameters, or a particular series of tags that act as a finger-print for the table you want. Search for the minimum needed to pin-point that table.
If the table is something like <table id="foo">, then use doc.at('table#foo'). If it's in a <div class="bar"><table> use doc.at('div.bar table'). In any case, use the smallest sized accessor necessary to get the job done. That will increase your chances of success if anything in the HTML changes in the future.

How do I write a web scraper in Ruby?

I would like to crawl a popular site (say Quora) that doesn't have an API and get some specific information and dump it into a file - say either a csv, .txt, or .html formatted nicely :)
E.g. return only a list of all the 'Bios' of the Users of Quora that have, listed in their publicly available information, the occupation 'UX designer'.
How would I do that in Ruby ?
I have a moderate enough level of understanding of how Ruby & Rails work. I just completed a Rails app - mainly all written by myself. But I am no guru by any stretch of the imagination.
I understand RegExs, etc.
Your best bet would be to use Mechanize.It can follow links, submit forms, anything you will need, web client-wise. By the way, don't use regexes to parse HTML. Use an HTML parser.
If you want something more high level, try wombat, which is this gem I built on top of Mechanize and Nokogiri. It is able to parse pages and follow links using a really simple and high level DSL.
I know the answer has been accepted, but Hpricot is also very popular for parsing HTML.
All you have to do is take a look at the html source of the pages and try to find a XPath or CSS expression that matches the desired elements, then use something like:
doc.search("//p[#class='posted']")
Mechanize is awesome. If you're looking to learn something new though, you could take a look at Scrubyt: https://github.com/scrubber/scrubyt. It looks like Mechanize + Hpricot. I've never used it, but it seems interesting.
Nokogiri is great, but I find the output messy to work with. I wrote a ruby gem to easily create classes off HTML: https://github.com/jassa/hyper_api
The HyperAPI gem uses Nokogiri to parse HTML with CSS selectors.
E.g.
Post = HyperAPI.new_class do
string title: 'div#title'
string body: 'div#body'
string author: '#details .author'
integer comments_count: '#extra .comment' do
size
end
end
# => Post
post = Post.new(html_string)
# => #<Post title: 'Hi there!', body: 'This blog post will talk about...', author: 'Bob', comments_count: 74>

Generate summary of an url like facebook in Ruby

Is there any gem in ruby to generate a summary of an url similar to what facebook does when you post a link.
None that I'm aware of, but it should't be too hard to roll your own. In the simplest case you can just require 'open-uri' and then use the open method to retrieve the contents of the site, or go for one of the HTTP libraries.
Once you got the document, all you have to do is use something like Nokogori or Hpricot to get the title, first paragraph of text and an image and you are done.
Generating a thumbnail isn't a straightforward task. The page has to be rendered, the window captured, shrunk down, then stored or returned. While it would be possible for a gem to do it, there would be significant overhead.
There are websites that can create the thumbnails, then you can reference the image:
Websnapr
Webthumb
ShrinkTheWeb
iWEBTOOL
I haven't tried them, but there's a good page discussing the first two on The Accidental Technologist.
If you need some text from the page, its simple to grab some, but making it be sensible is a different problem:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.example.com'))
page_text = doc.text
print page_text.gsub(/\s+/, ' ').squeeze(' ')[0..99]
# >> IANA — Example domains Domains Numbers Protocols About IANA Example Domains As described in RFC 2606

Resources