Scraping Coursera results in 404 - ruby

Why would the following result in a 404?
require 'rubygems'
require 'capybara'
require 'capybara/dsl'
require 'capybara/poltergeist'
class CourseraScraper
include Capybara::DSL
def initialize
Capybara.default_driver = :poltergeist
Capybara.run_server = false
Capybara.app_host = "https://www.coursera.org/"
visit '/'
save_and_open_page
end
end
CourseraScraper.new

You're not getting a 404 until the page is saved to a file and then opened in your browser and, as a guess, is being driven by some JS being loaded from the wrong referrer or not being loaded because of the referrer.
You can see this by adding assert_text("Take the world's best courses, online.") to the bottom of your test - which passes just fine because poltergeist is working with the normal coursera.org page

I wonder if there's a redirect implemented if you don't have the right referral data. When I run your code, I briefly see the site load before being taken to the 404.
If instead I visit a bad url, I don't get a 404 page at all but instead a message saying "Sorry, the class you were looking for cannot be found. Please check your URL and try again."
https://www.coursera.org/badurl

Related

Can mechanize work with browsers?

I am using ruby's gem mechanize to automate a file upload after logging in to a particular site..
I am able to login using
#!/usr/bin/ruby
require 'rubygems'
require 'mechanize'
#creating an object for Mechanize class
a = Mechanize.new { |agent|
# site refreshes after login
agent.follow_meta_refresh = true
}
#Getting the page
a.get('https://www.samplesite.com/') do |page|
puts page.title
form = page.forms.first
form.fields.each {|f| puts f.name}
form['username'] = "username"
form['password'] = "password"
# Then submitting the form and reaching the page
Now there are two questions...
a. Can I see this happening on browser using any agent or tool?
b. Is there any way to keep the mechanize waiting for the page to load?
Do you try Selenium WebDriver ?
It should easily integrates with your Ruby program

Trouble logging in to Pinterest with ruby mechanize

I am trying to build a simple crawler that can login to Pinterest and pin a few things to my board.
The first step of this is successfully login. I read through the documentation and it seems like this should work but it doesn't.
When I run the code I expect it to print out a title like "Mary... is mary... on Pinterest"
But instead the title of the page is "Pinterest-The Visual Discovery Tool"
I think there's something wrong with my script.
require 'rubygems'
require 'mechanize'
require 'pry'
a = Mechanize.new
a.get('https://www.pinterest.com/login/') do |page|
form = page.forms.first
form.fields[0].value = "m...#gmail.com"
form.fields[1].value = "some_password"
new_page = form.submit
puts new_page.title
end
Keep in mind that mechanize has no capability of executing javascript and if the page depends on javascript, it may not load correctly. Although I only did a light read through of the source, it looks like it is very dependent on javascript and therefore can't be crawled effectively with mechanize.
Another option might be to use a headless browser like watir or selenium.

Login to Vimeo Via Mechanize (ruby)

I am trying to login to my vimeo account using Mechanize in order to scrape hundreds of video titles and urls. Here is my code:
task :import_list => :environment do
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
agent.user_agent = "Mac Safari"
puts "Logging in..."
page = agent.get("http://vimeo.com/log_in")
form = page.forms[0]
form.fields[0].value = 'sample#email.com'
form.fields[1].value = 'somepassword'
page = agent.submit(form)
pp page
end
and my error message:
401 => Net::HTTPUnauthorized
This is running through a rake task if it matters at all.
Any ideas?
Not sure how to do it with Mecnanize but here is code to do it with Capybara:
require 'capybara/dsl'
require 'selenium-webdriver'
Capybara.run_server = false
Capybara.default_driver = :selenium
class Vimeo
include Capybara::DSL
def go
visit "https://vimeo.com/log_in"
fill_in "email", :with => "ivan.bisevac#gmail.com"
fill_in "password", :with => "strx8UnK0a-"
find("span.submit > input").click
end
end
v = Vimeo.new
v.go
Also, Capybara is better for scraping javascript sites.
my first thought was:
Vimeo login does not work without JavaScript, so it's not possible to login with Mechanize.
To test my bold statement:
without javascript
disable javascript for all sites in your browser
try to login ( fill out the form in your browser like you normally do )
you'll get an unauthorized message on the resulting page
with javascript
enable javascript
everything works as expected
update
Vimeo.com uses the following querystring when logging in.
Gonna try and post the string manually with Mechanize.
action=login&service=vimeo&email=your-email&password=your-password&token=k7yd5du3L9aa5577bb0e8fc
update 2
I've got a Ruby Rake task that logs in to a Vimeo Pro account
and reads the HTTP Live Streaming link from a video settings page.
update 3
I've posted a working Ruby Rake task: https://gist.github.com/webdevotion/5635755.
Have you tried using the official Vimeo API?
It seems that authorization give something 'token'
http header part:
action=login&service=vimeo&email=your_mail&password=asfsdfsdf&token=51605c24c92a4d4706ecbe9ded7e3851

How to use function present in one file in another file in watir ruby

I am new to Ruby and need help in accessing a function which is present in another file. The scenario is I have 2 files lets say test.rb and functions.rb
in test.rb i have the below code
require 'rubygems'
require 'watir'
require 'win32ole'
require 'erb'
require 'ostruct'
require 'C:\functions'
include Watir
U_RL="some url"
browser
if
ie.text.include?"There is a problem with this website's security certificate."
then
ie.link(:id, 'overridelink').click
end
now in the functions.rb file I have the below code
require 'rubygems'
require 'watir'
require 'win32ole'
include Watir
def browser
ie=IE.new
ie.maximize
ie.goto U_RL
ie.focus
ie.bring_to_front
ie.wait()
end
When I run test.rb, I get the error "Undefined local variable or method 'ie' for main:object
I can see that the browser is opened and even the the mentioned url is coming up, but when the security warning page comes up it is not clicking on ie.link(:id, 'overridelink').click.
Please let me know how to over come this
In your definition of the browser method, the scope of ie is local to that method. It can not be accessed outside of it.
This code needs to be completely refactored, but for now, you could just have browser return the local instance of ie, and set it in test.rb
functions.rb:
def browser
ie=IE.new
ie.maximize
ie.goto U_RL
ie.focus
ie.bring_to_front
ie.wait()
ie # last value is returned in ruby; can be explicit and do `return ie` as well
end
test.rb:
ie = browser
if ie.text.include?"There is a problem with this website's security certificate."
then
ie.link(:id, 'overridelink').click
end
You should require second file. Like this
require_relative 'functions'

Mechanize on HTTPS site

Has anyone used the Mechanize gem on a site that required SSL?
When I try to access such a website Mechanize tries to use standard HTTP which results in endless redirections between http:// and https://.
Mechanize works just fine with HTTPS. Try setting
agent.log = Logger.new(STDOUT)
to see what's going on between Mechanize and the server. If you are still having trouble, post a sample of the code and somebody will help.
I just gave Mechanize a try with my company's web site. The home page is HTTP, but it contains a link, "customer login," which sends the browser to an HTTPS page. It worked fine. The code is:
#!/usr/bin/ruby1.8
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get("http://www.not_the_real_url.com")
link = page.link_with(:text=>"CUSTOMER LOGIN")
page = link.click
form = page.forms.first
form['user_login'] = 'not my real login name'
form['user_password'] = 'not my real password'
page = form.submit

Resources