I am trying to build a simple crawler that can login to Pinterest and pin a few things to my board.
The first step of this is successfully login. I read through the documentation and it seems like this should work but it doesn't.
When I run the code I expect it to print out a title like "Mary... is mary... on Pinterest"
But instead the title of the page is "Pinterest-The Visual Discovery Tool"
I think there's something wrong with my script.
require 'rubygems'
require 'mechanize'
require 'pry'
a = Mechanize.new
a.get('https://www.pinterest.com/login/') do |page|
form = page.forms.first
form.fields[0].value = "m...#gmail.com"
form.fields[1].value = "some_password"
new_page = form.submit
puts new_page.title
end
Keep in mind that mechanize has no capability of executing javascript and if the page depends on javascript, it may not load correctly. Although I only did a light read through of the source, it looks like it is very dependent on javascript and therefore can't be crawled effectively with mechanize.
Another option might be to use a headless browser like watir or selenium.
Related
I am trying to scrape my university web site using ruby mechanize. This my ruby script;
require 'mechanize'
agent = Mechanize.new
agent.get('https://kampus.izu.edu.tr')
This script doesn't return response. I need to see login page but the response is different. I also tried it with cURL like this;
curl https://kampus.izu.edu.tr
This works and return the login page. What am I missing?
Make sure that you are storing the output of agent.get(). From your example, I don't see how you would be using/printing the response of this request.
Try this:
require 'mechanize'
agent = Mechanize.new
page = agent.get("https://kampus.izu.edu.tr")
puts page.body
The .get() method returns a Mechanize::Page object that you can call other methods on, such as .css(), to select elements by css selectors. Check out the documentation here
I am using ruby's gem mechanize to automate a file upload after logging in to a particular site..
I am able to login using
#!/usr/bin/ruby
require 'rubygems'
require 'mechanize'
#creating an object for Mechanize class
a = Mechanize.new { |agent|
# site refreshes after login
agent.follow_meta_refresh = true
}
#Getting the page
a.get('https://www.samplesite.com/') do |page|
puts page.title
form = page.forms.first
form.fields.each {|f| puts f.name}
form['username'] = "username"
form['password'] = "password"
# Then submitting the form and reaching the page
Now there are two questions...
a. Can I see this happening on browser using any agent or tool?
b. Is there any way to keep the mechanize waiting for the page to load?
Do you try Selenium WebDriver ?
It should easily integrates with your Ruby program
I just wondering for some informations about mechanize and found the below code from Internet:
require 'mechanize'
require 'logger'
agent = Mechanize.new
agent.user_agent_alias = 'Windows IE 9'
agent.follow_meta_refresh = true
agent.log = Logger.new(STDOUT)
Could any one please explain why user_agent_alias and follow_meta_refresh is needed when,mechanize itself is a browser?
Mechanize isn't a browser. It is a page parser that gives you enough methods to make it easy/convenient to navigate through a site. But, in no way is it a browser.
user_agent_alias sets the signature of Mechanize when it's running and making page requests. In your example it's trying to spoof a site by masquerading as "IE 9", but that signature isn't going to fool any system that is sniffing the User-Agent header.
follow_meta_refresh, well, you should take the time to search for "meta" tags with the "refresh" parameter. It's trivial to find out about it, and, then you'll understand. Or just read the documentation for it.
Here sample code:
require 'nokogiri'
require 'open-uri'
begin
doc = Nokogiri::HTML(open(url))
rescue
puts "Fehler ist aufgetretten..."
end
Some parts of the page are loaded asynchronous and i'm missing some values, which are loaded later. Is there any way to open the url, to wait 10 seconds and after that to assign it to the variable doc? Any solutions/ideas with bash/lynx/wget are welcome too :)
Unfortunately, waiting 10 seconds isn't going to work, because neither open-uri nor Nokogiri will execute the javascript that loads content asynchronously. You'll need to use a browser driver like Watir or Watir-webdriver. If JRuby is an option, you can use Celerity which is a browser emulator that supports some javascript (using the Watir API), and will likely perform the task you need.
I have little experience in ruby language and I know it's a powerful language specifically in web programming. My questions is how can I write program that automatically log in to a website and download the daily news feeds. i.e. logging in to a forum website and downloading all the threads. Thnx
For tasks like these that simulate a web browser experience, I use the mechanize gem. It works like this:
require 'rubygems'
require 'mechanize'
www = Mechanize.new
www.get('http://your.site/path/to/login/page') do |login_page|
inside_page = login_page.form_with(:action => '/path/to/login/form/action') do |f|
f.form_username_element_name = "username"
f.form_password_element_name = "password"
end.click_button
# Do stuff with "inside_page", like navigate, scrape links, etc...
# See the mechanize docs for details
end