Extracting a Link using Mechanize in Ruby - ruby

I'm trying to extract a link from an element (.jobtitle a) using mechanize. I'm trying to do that in the link variable below. Anyone know how?
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://id.indeed.com/')
indeed_form = page.form('jobsearch')
indeed_form.q = ''
indeed_form.l = 'Indonesia'
page = agent.submit(indeed_form)
page.search(".row , .jobtitle a").each do |job|
job_title = job.search(".jobtitle a").map(&:text).map(&:strip)
company = job.search(".company span").map(&:text).map(&:strip)
date = job.search(".date").map(&:text).map(&:strip)
location = job.search(".location span").map(&:text).map(&:strip)
summary = job.search(".summary").map(&:text).map(&:strip)
link = job.search(".jobtitle a").map(&:text).map(&:strip)
end

I don't think you can select attributes with css paths.
From the mechanize documentation:
search()
Search for paths in the page using Nokogiri's search. The paths can be XPath or CSS and an optional Hash of namespaces may be appended.
See Nokogiri::XML::Node#search for further details.
You should check out XPaths instead. See e.g.:
Getting attribute using XPath
http://www.w3schools.com/xpath/
You may need to rewrite the way you iterate through the page.

Related

Searching by text with Mechanize/Nogokiri

I'm trying to scrape some data on average GPA and more from a lot of pages similar to this one:
http://www.ptcas.org/ptcas/public/Listing.aspx?seqn=3200&navid=10737426783
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.ptcas.org/ptcas/public/Listing.aspx?seqn=3200&navid=10737426783')
gpa_headers = page.xpath('//h3[contains(text(), "GPA")]')
pp gpa_headers
My issue is that gpa_headers is nil but there is at least one h3 element containing "GPA".
What could be causing this issue? I thought it may be that since the page has dynamic elements that Mechanize had some issue with that yet I can puts page.body and the output includes:
... <h3 style="text-align:center;">GPA REQUIREMENT</h3> ...
Which, by my understanding should be found with the xpath I used.
If there is a better approach to this I would like to know that too.
This looks to be a problem with the DOM structure of the site, as it contains a tag named style which isn't being closed and looks like this:
<td colspan='7'><style='text-align:center;font-style:italic'>The
institution has been granted Candidate for Accreditation status by the
Commission on Accreditation in Physical Therapy Education (1111 North
Fairfax Street, Alexandria, VA, 22314; phone: 703.706.3245; email: <a
href='mailto:accreditation#apta.org'>accreditation#apta.org</a>).
Candidacy is not an accreditation status nor does it assure eventual
accreditation. Candidate for Accreditation is a pre-accreditation
status of affiliation with the Commission on Accreditation in Physical
Therapy Education that indicates the program is progressing toward
accreditation.</td>
as you can see, the td tag closes but the inner style never did.
If you don't need this part of the code I would recommend removing this before trying to work with the entire response. I don't have experience with ruby but I would do something like:
Get the raw body of the response.
Replace the part that matches this regex '(<style=\'.*)</td>' with empty string, or close the tag yourself.
Work with this new response body.
Now you would be able to work with xpath selectors.
eLRuLL gives the source of the problem above. Here is an example of how I fixed the issue:
require 'mechanize'
require 'nokogiri'
agent = Mechanize.new
page = agent.get('http://www.ptcas.org/ptcas/public/Listing.aspx?seqn=3200&navid=10737426783')
mangled_text = page.body
fixed_text = mangled_text.sub(/<style=.+?<\/td>/, "</td>")
page = Nokogiri::HTML(fixed_text)
gpa_headers = page.xpath('//h3[contains(text(), "GPA")]')
pp gpa_headers
This will return the header that I was looking for above:
[#<Nokogiri::XML::Element:0x2b28a8ec0c38 name="h3" attributes=[#<Nokogiri::XML::Attr:0x2b28a8ec0bc0 name="style" value="text-align:center;">] children=[#<Nokogiri::XML::Text:0x2b28a8ec0774 "GPA REQUIREMENT">]>]
A more reliable solution is to work with a HTML5 parser like nokogumbo:
require 'nokogumbo'
doc = Nokogiri::HTML5(page.body)
gpa_headers = doc.search('//h3[contains(text(), "GPA")]')

Posting data on website using Mechanize Nokogiri Selenium

I need to post data on a website through a program.
To achieve this I am using Mechanize Nokogiri and Selenium.
Here's my code :
def aeiexport
# first Mechanize is submitting the form to identify yourself on the website
agent = Mechanize.new
agent.get("https://www.glou.com")
form_login_AEI = agent.page.forms.first
form_login_AEI.util_vlogin = "42"
form_login_AEI.util_vpassword = "666"
# this is suppose to submit the form I think
page_compet_list = agent.submit(form_login_AEI, form_login_AEI.buttons.first)
#to be able to scrap the page you end up on after submitting form
body = page_compet_list.body
html_body = Nokogiri::HTML(body)
#tds give back an array of td
tds = html_body.css('.L1').xpath("//table/tbody/tr[position()>1]/td")
# Checking my array of td with some condition
tds.each do |td|
link = td.children.first # Select the first children
if link.html = "2015 32 92 0076 012"
# Only consider the html part of the link, if matched follow the previous link
previous_td = td.previous
previous_url = previous_td.children.first.href
#following the link contained in previous_url
page_selected_compet = agent.get(previous_url)
# to be able to scrap the page I end up on
body = page_selected_compet.body
html_body = Nokogiri::HTML(body)
joueur_access = html_body.search('#tabs0head2 a')
# clicking on the link
joueur_access.click
rechercher_par_numéro_de_licence = html_body.css('.L1').xpath("//table/tbody/tr/td[1]/a[1]")
pure_link_rechercher_par_numéro_de_licence = rechercher_par_numéro_de_licence['href']
#following pure_link_rechercher_par_numéro_de_licence
page_submit_licence = agent.get(pure_link_rechercher_par_numéro_de_licence)
body_submit_licence = page_submit_licence.body
html_body = Nokogiri::HTML(body_submit_licence)
#posting my data in the right field
form.field_with(:name => 'lic_cno[0]') == "9511681"
1) So far what do you think about this code, Do you think there is an error in there
2) This part is the one I am really not sure about : I have posted my data in the right field but now I need to submit it. The problem is that the button I need to click is like this:
<input type="button" class="button" onclick="dispatchAndSubmit(document.JoueurRechercheForm, 'rechercher');" value="Rechercher">
it triggers a javascript function onclick. I am triying Selenium to trigger the click event. Then I end up on another page, where I need to click a few more times.. I tried this:
driver.find_element(:value=> 'Rechercher').click
driver.find_element(:name=> 'sel').click
driver.find_element(:value=> 'Sélectionner').click
driver.find_element(:value=> 'Inscrire').click
But so far I have not succeeded in posting the data.
Could you please tell me if selenium will enable me to do what I need to do. If can I do it ?
At a glance your code can use less indentation and more white space/empty lines to separate the internal logic of AEIexport (which should be changed to aei_export since Ruby uses snake case for method names. You can find more recommendations on how to style ruby code here).
Besides the style of your code, an error I found at the beginning of your method is using an undefined variable page when defining form_login_AEI.
For your second question, I'm not familiar with Selenium; however since it does use a real web browser it can handle JavaScript. Watir is another possible solution.
An alternative would be to view the page source (i.e. in Firebug) and understand what the JavaScript on the page does. Then use Mechanize to follow the link manually.

How to count the number of images on a certain page using Mechanize?

I'm using Mechanize in a Rails 4 application. I created a new agent to scrape a page:
clienturl = #bid.mozs.where(is_main: true).first.attributes['url']
agent = Mechanize.new
#page = agent.get('http://' + clienturl)
#url = #page.uri
I can do things like get the uri, title and meta description. I'd like to now get the count of images on the page and how many of those images are missing alt attributes. Is this possible with Mechanize?
Do something like this:
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.iana.org/domains/reserved')
doc = page.parser
img_count = doc.search('img').size # => 2
img_w_alt_count = doc.search('img[#alt]').size # => 1
img_count - img_w_alt_count # => 1
Nokogiri is the parser inside Mechanize. parser returns an instance of the parsed DOM. From that we can ask Nokogiri to search for all nodes matching a selector. I used a CSS selector, but you can use XPath also; CSS tends to be more readable and less verbose.
search returns a NodeSet, so size tells us how many nodes matched.

Ruby newb: how do I extract a substring?

I'm trying to scrape a CBS sports page for shot data in the NBA.
Here is the page I'm starting out with and using as a sample: http://www.cbssports.com/nba/gametracker/shotchart/NBA_20131115_MIL#IND
In the source, I found a string that contains all the data that I need. This string, in the webpage source code, is directly under var CurrentShotData = new.
What I want is to turn this string in the source into a string I can use in ruby. However, I'm having some trouble with the syntax. Here's what I have.
require 'nokogiri'
require 'mechanize'
a = Mechanize.new
a.get('http://www.cbssports.com/nba/gametracker/shotchart/NBA_20131114_HOU#NY') do |page|
shotdata = page.body.match(/var currentShotData = new String\(\"(.*)\"\)\; var playerDataHomeString/m)[1].strip
print shotdata
end
I know I must be doing this wrong... it seems so needlessly complex and on top of that it isn't working for me. Could someone enlighten me on the simple way to get this string into Ruby?
Try to replace:
shotdata = page.body.match(/var currentShotData = new String\(\"(.*)\"\)\; var playerDataHomeString/m)[1].strip
with:
shotdata = page.body.match(/var currentShotData = new String\(\"(.*?)\"\)\; var playerDataHomeString/m)[1].strip
changing the (.*) with (.*?) will cause a lazy evaluation (matching of minimal number of characters) of the string which is the behavior you want.

Using mechanize to check for div with similar but different names

Currently I'm doing the following:
if( firstTemp == true )
total = doc.xpath("//div[#class='pricing condense']").text
else
total = doc.xpath("//div[#class='pricing ']").text
end
I'm wondering is there anyway that I can get mechanize to automatically fetch divs that contain the string "pricing" ?
Is doc a Mechanize::Page? usually the convention is page for those and doc for Nokogiri::HTML::Document. Anyway, for either one try:
doc.search('div.pricing')
For just the first one, use at instead of search:
doc.at('div.pricing')

Resources