Maintaining cookies between Mechanize requests - ruby

I'm trying to use the Ruby version of Mechanize to extract my employer's tickets from a ticket management system that we're moving away from that does not supply an API.
Problem is, it seems Mechanize isn't keeping the cookies between the post call and the get call shown below:
require 'rubygems'
require 'nokogiri'
require 'mechanize'
#agent = Mechanize.new
page = #agent.post('http://<url>.com/user_session', {
'authenticity_token' => '<token>',
'user_session[login]' => '<login>',
'user_session[password]' => '<password>',
'user_session[remember_me]' => '0',
'commit' => 'Login'
})
page = #agent.get 'http://<url>.com/<organization>/<repo-name>/tickets/1'
puts page.title
user_session is the URL to which the site's login page POSTs, and I've verified that this indeed logs me in. But the page that returns from the get call is the 'Oops, you're not logged in!' page.
I've verified that clicking links on the page that returns from the post call works, but I can't actually get to where I need to go without JavaScript. And of course I've done this successfully on the browser with the same login.
What am I doing wrong?

Okay this might help you - first of all what version of mechanize are you using? You need to identify, if this problem is due to the cookies being overwritten/cleaned by mechanize between the requests or if the cookies are wrong/not being set in the first place. You can do that by adding a puts #agent.cookie_jar.jar inbetween the two requests, to see what is stored.
If its a overwriting issue, you might be able to solve it by collecting the cookies from the first request, and applying them to the second. There are many ways to do this:
One way is to just do a temp_jar = agent.cookie_jar.jar an then just going through each cookie and add it again using the .add method
HOWEVER - the easiest way is by just installing the latest 2.1 pre release of mechanize (many fixes), because you will then be able to do it very simply.
To install the latest do a gem install mechanize --pre and make sure to get rid of the old version of mechanize gem uninstall mechanize 'some_version' after this, you can simply do as follows:
require 'rubygems'
require 'nokogiri'
require 'mechanize'
#agent = Mechanize.new
page = #agent.post('http://<url>.com/user_session', {
'authenticity_token' => '<token>',
'user_session[login]' => '<login>',
'user_session[password]' => '<password>',
'user_session[remember_me]' => '0',
'commit' => 'Login'
})
temp_jar = #agent.cookie_jar
#Do whatever you need an use the cookies again in a new session after that
#agent = Mechanize.new
#agent.cookie_jar = temp_jar
page = #agent.get 'http://<url>.com/<organization>/<repo-name>/tickets/1'
puts page.title
BTW the documentation is here http://mechanize.rubyforge.org/index.html

Mechanize would automatically send cookies obtained from the response in the consecutive request. You can use the same agent without re-new.
require 'mechanize'
#agent = Mechanize.new
#agent.post(create_sessions_url, params, headers)
#agent.get(ticket_url)
Tested with mechanize 2.7.6.

Related

How to extract data from dynamic collapsing table with hidden elements using Nokogiri and Ruby

I am trying to scrape through the following website :
https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html
to get all of the state statistics on coronavirus.
My code below works:
require 'nokogiri'
require 'open-uri'
require 'httparty'
require 'pry'
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
doc = Nokogiri::HTML.parse(open(url))
total_cases = doc.css("span.count")[0].text
total_deaths = doc.css("span.count")[1].text
new_cases = doc.css("span.new-cases")[0].text
new_deaths = doc.css("span.new-cases")[1].text
However, I am unable to get into the collapsed data/gridcell data.
I have tried searching by the class .aria-label and by the .rt-tr-group class. Any help would be appreciated. Thank you.
Although the answer of Layon Ferreira already states the problem it does not provide the steps needed to load the data.
Like already said in the linked answer the data is loaded asynchronously. This means that the data is not present on the initial page and is loaded through the JavaScript engine executing code.
When you open up the browser development tools, go to the "Network" tab. You can clear out all requests, then refresh the page. You'll get to see a list of all requests made. If you're looking for asynchronously loaded data the most interesting requests are often those of type "json" or "xml".
When browsing through the requests you'll find that the data you're looking for is located at:
https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json
Since this is JSON you don't need "nokogiri" to parse it.
require 'httparty'
require 'json'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
data = JSON.parse(response.body)
When executing the above you'll get the exception:
JSON::ParserError ...
This seems to be a Byte Order Mark (BOM) that is not removed by HTTParty. Most likely because the response doesn't specify an UTF-8 charset.
response.body[0]
#=> ""
format '%X', response.body[0].ord
#=> "FEFF"
To correctly handle the BOM Ruby 2.7 added the set_encoding_by_bom method to IO which is also available on StringIO.
require 'httparty'
require 'json'
require 'stringio'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
body = StringIO.new(response.body)
body.set_encoding_by_bom
data = JSON.parse(body.gets(nil))
#=> [{"Jurisdiction"=>"Alabama", "Range"=>"10,001 to 20,000", "Cases Reported"=>10145, ...
If you're not yet using Ruby 2.7 you can use a substitute to remove the BOM, however the former is probably the safer option:
data = JSON.parse(response.body.force_encoding('utf-8').sub(/\A\xEF\xBB\xBF/, ''))
That page is using AJAX to load its data.
in that case you may use Watir to fetch the page using a browser
as answered here: https://stackoverflow.com/a/13792540/2784833
Another way is to get data from the API directly.
You can see the other endpoints by checking the network tab on your browser console
I replicated your code and found some of the errors that you might have done
require 'HTTParty'
will not work. You need to use
require 'httparty'
Secondly, there should be quotes around your variable url value i.e
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
Other than that, it just worked fine for me.
Also, if you're trying to get the Covid-19 data you might want to use these APIs
For US Count
For US Daily Count
For US Count - States
You could learn more about the APIs here

how to extract these specific links usuing mechanize in Ruby?

I've been trying but I cant get these specific links on this page:
http://www.windowsphone.com/en-us/store/top-free-apps
I want to get each one of the links on the left side of this page, entertainment for example, but I cant find the right reference to get them.
it's the script:
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.windowsphone.com/en-us/store/top-free-apps")
page.links_with(???)
what should I put instead of ??? so that I cant get those links?
I've tried stuff like:
page.links_with(:class => 'categoryNav navText')
OR
page.links_with(:class => 'categoryNav')
OR
page.links_with(:class => 'navText')
etc
can anyone help please?
Using page.parser, you can access the underlying Nokogiri object. This allows you to use xpath to make your search.
The idea here is that all those links to have a 'data-ov' attribute that starts with 'AppLeftMerch'. This is something we can use to identify them using the 'starts-with' function.
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.windowsphone.com/en-us/store/top-free-apps")
page.parser.xpath("//a[starts-with(#data-ov,'AppLeftMerch')]").each do |link|
puts link[:href]
end

How do I use Nokogiri to parse a bit.ly stats page?

I'm trying to parse the Twitter usernames from a bit.ly stats page using Nokogiri:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://bitly.com/U026ue+/global'))
twitter_accounts = []
shares = doc.xpath('//*[#id="tweets"]/li')
shares.map do |tweet|
twitter_accounts << tweet.at_css('.conv.tweet.a')
end
puts twitter_accounts
My understanding is that Nokogiri will save shares in some form of tree structure, which I can use to drill down into, but my mileage is varying.
That data is coming in from an Ajax request with a JSON response. It's pretty easy to get at though:
require 'json'
url = 'http://search.twitter.com/search.json?_usragnt=Bitly&include_entities=true&rpp=100&q=nowness.com%2Fday%2F2012%2F12%2F6%2F2643'
hash = JSON.parse open(url).read
puts hash['results'].map{|x| x['from_user']}
I got that URL by loading the page in Chrome and then looking at the network panel, I also removed the timestamp and callback parameters just to clean things up a bit.
Actually, Eric Walker was onto something. If you look at doc, the section where the tweets are supposed to be look like:
<h2>Tweets</h2>
<ul id="tweets"></ul>
</div>
This is likely because they're generated by some JavaScript call which Nokogiri isn't executing. One possible solution is to use watir to traverse to the page, load the JavaScript and then save the HTML.
Here is a script that accomplishes just that. Note that you had some issues with your XPath arguments which I've since solved, and that watir will open a new browser every time you run this script:
require 'watir'
require 'nokogiri'
browser = Watir::Browser.new
browser.goto 'http://bitly.com/U026ue+/global'
doc = Nokogiri::HTML.parse(browser.html)
twitter_accounts = []
shares = doc.xpath('//li[contains(#class, "tweet")]/a')
shares.each do |tweet|
twitter_accounts << tweet.attr('title')
end
puts twitter_accounts
browser.close
You can also use headless to prevent a window from opening.

How do I scrape data from a page that loads specific data after the main page load?

I have been using Ruby and Nokogiri to pull data from a URL similar to this one from the hollister website: http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL=TrackDetailView&orderNumber=1316358
My script looks like this right now:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL=TrackDetailView&orderNumber=1316358"))
puts page.css("h3[data-property=GLB_ORDERNUMBERSYMBOL]")[0].text
My problem is that the Hollister page has some sort of asynchronous loading of data, such that when my script checks the area of the page with order specific data for a page element, it doesn't exist yet. I.E., the <h3> with data-property=GBL_ORDERNUMBERSYMBOL doesn't yet exist, but in the browser if you let it load for another ten seconds, the DOM and HTML change to reflect the specific order details.
What is the best way to capture this data that loads after the fact? I have tried using the watir-webdriver, but not sure what I would need to do to make that one work either.
Try installing Capybara-webkit (make sure you have QtWebKit installed, otherwise the gem install would fail). This will give you a headless solution. Then try this:
require 'capybara-webkit'
require 'capybara/dsl'
require 'nokogiri'
require 'open-uri'
url = 'http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL=TrackDetailView&orderNumber=1316358'
#change the capybara config to DSL and to use webkit
include Capybara::DSL
Capybara.current_driver = :webkit
visit(url)
doc = Nokogiri::HTML.parse(body)
then parse the body as you would normally. To remove all that error messages try this:
Capybara.register_driver :webkit do |app|
Capybara::Driver::Webkit.new(app, :stdout => nil)
end
I am not sure how to do it with Open-URI, but if you want to use Watir-Webdriver, the following works.
require 'watir-webdriver'
b = Watir::Browser.new
b.goto('http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL=TrackDetailView&orderNumber=1316358')
puts b.h3(:class, 'order-num').when_present.text
Note that a when_present() is performed on the h3 tag. What this means is that the script will wait for the h3 to appear before trying to get its text. If you know there are parts that take time to load, adding an explicit wait usually solves the problem.
Following #benaneesh's answer I had to make slight modifications to get it to work in my ruby script and not show the unknown url messages...
require 'capybara-webkit'
require 'capybara/dsl'
require 'nokogiri'
require 'open-uri'
include Capybara::DSL
Capybara.current_driver = :webkit
Capybara::Webkit.configure do |config|
config.block_unknown_urls
config.allow_url("*mysite.com")
end
#... rest of code

Use ruby mechanize to get data from foursquare

I am trying to use ruby and Mechanize to parse data on foursquare's website. Here is my code:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://foursquare.com')
page = agent.click page.link_with(:text => /Log In/)
form = page.forms[1]
form.F12778070592981DXGWJ = ARGV[0]
form.F1277807059296KSFTWQ = ARGV[1]
page = form.submit form.buttons.first
puts page.body
But then, when I run this code, the following error poped up:
C:/Ruby192/lib/ruby/gems/1.9.1/gems/mechanize-2.0.1/lib/mechanize/form.rb:162:in
`method_missing': undefined method `F12778070592981DXGWJ='
for #<Mechanize::Form:0x2b31f70> (NoMethodError)
from four.rb:10:in `<main>'
I checked and found that these two variables for the form object "F12778070592981DXGWJ" and "F1277807059296KSFTWQ" are changing every time when I try to open foursquare's webpage.
Does any one have the same problem before? your variables change every time you try to open a webpage? How should I solve this problem?
Our project is about parsing the data on foursquare. So I need to be able to login first.
Mechanize is useful for sites which don't expose an API, but Foursquare has an established REST API already. I'd recommend using one of the Ruby libraries, perhaps foursquare2. These libraries abstract away things like authentication, so you just have to register your app and use the provided keys.
Instead of indexing the form fields by their name, just index them by their order. That way you don't have to worry about the name that changes on each request:
form.fields[0].value = ARGV[0]
form.fields[1].value = ARGV[1]
...
However like dwhalen said, using the REST API is probably a much better way. That's why it's there.

Resources