Use ruby mechanize to get data from foursquare - ruby

I am trying to use ruby and Mechanize to parse data on foursquare's website. Here is my code:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://foursquare.com')
page = agent.click page.link_with(:text => /Log In/)
form = page.forms[1]
form.F12778070592981DXGWJ = ARGV[0]
form.F1277807059296KSFTWQ = ARGV[1]
page = form.submit form.buttons.first
puts page.body
But then, when I run this code, the following error poped up:
C:/Ruby192/lib/ruby/gems/1.9.1/gems/mechanize-2.0.1/lib/mechanize/form.rb:162:in
`method_missing': undefined method `F12778070592981DXGWJ='
for #<Mechanize::Form:0x2b31f70> (NoMethodError)
from four.rb:10:in `<main>'
I checked and found that these two variables for the form object "F12778070592981DXGWJ" and "F1277807059296KSFTWQ" are changing every time when I try to open foursquare's webpage.
Does any one have the same problem before? your variables change every time you try to open a webpage? How should I solve this problem?
Our project is about parsing the data on foursquare. So I need to be able to login first.

Mechanize is useful for sites which don't expose an API, but Foursquare has an established REST API already. I'd recommend using one of the Ruby libraries, perhaps foursquare2. These libraries abstract away things like authentication, so you just have to register your app and use the provided keys.

Instead of indexing the form fields by their name, just index them by their order. That way you don't have to worry about the name that changes on each request:
form.fields[0].value = ARGV[0]
form.fields[1].value = ARGV[1]
...
However like dwhalen said, using the REST API is probably a much better way. That's why it's there.

Related

How to extract data from dynamic collapsing table with hidden elements using Nokogiri and Ruby

I am trying to scrape through the following website :
https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html
to get all of the state statistics on coronavirus.
My code below works:
require 'nokogiri'
require 'open-uri'
require 'httparty'
require 'pry'
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
doc = Nokogiri::HTML.parse(open(url))
total_cases = doc.css("span.count")[0].text
total_deaths = doc.css("span.count")[1].text
new_cases = doc.css("span.new-cases")[0].text
new_deaths = doc.css("span.new-cases")[1].text
However, I am unable to get into the collapsed data/gridcell data.
I have tried searching by the class .aria-label and by the .rt-tr-group class. Any help would be appreciated. Thank you.
Although the answer of Layon Ferreira already states the problem it does not provide the steps needed to load the data.
Like already said in the linked answer the data is loaded asynchronously. This means that the data is not present on the initial page and is loaded through the JavaScript engine executing code.
When you open up the browser development tools, go to the "Network" tab. You can clear out all requests, then refresh the page. You'll get to see a list of all requests made. If you're looking for asynchronously loaded data the most interesting requests are often those of type "json" or "xml".
When browsing through the requests you'll find that the data you're looking for is located at:
https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json
Since this is JSON you don't need "nokogiri" to parse it.
require 'httparty'
require 'json'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
data = JSON.parse(response.body)
When executing the above you'll get the exception:
JSON::ParserError ...
This seems to be a Byte Order Mark (BOM) that is not removed by HTTParty. Most likely because the response doesn't specify an UTF-8 charset.
response.body[0]
#=> ""
format '%X', response.body[0].ord
#=> "FEFF"
To correctly handle the BOM Ruby 2.7 added the set_encoding_by_bom method to IO which is also available on StringIO.
require 'httparty'
require 'json'
require 'stringio'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
body = StringIO.new(response.body)
body.set_encoding_by_bom
data = JSON.parse(body.gets(nil))
#=> [{"Jurisdiction"=>"Alabama", "Range"=>"10,001 to 20,000", "Cases Reported"=>10145, ...
If you're not yet using Ruby 2.7 you can use a substitute to remove the BOM, however the former is probably the safer option:
data = JSON.parse(response.body.force_encoding('utf-8').sub(/\A\xEF\xBB\xBF/, ''))
That page is using AJAX to load its data.
in that case you may use Watir to fetch the page using a browser
as answered here: https://stackoverflow.com/a/13792540/2784833
Another way is to get data from the API directly.
You can see the other endpoints by checking the network tab on your browser console
I replicated your code and found some of the errors that you might have done
require 'HTTParty'
will not work. You need to use
require 'httparty'
Secondly, there should be quotes around your variable url value i.e
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
Other than that, it just worked fine for me.
Also, if you're trying to get the Covid-19 data you might want to use these APIs
For US Count
For US Daily Count
For US Count - States
You could learn more about the APIs here

How to scrape pages which have lazy loading

Here is the code which i used for parsing of web page.I did it in rails console.But i am not getting any output in my rails console.The site which i want to scrape is having lazy loading
require 'nokogiri'
require 'open-uri'
page = 1
while true
url = "http://www.justdial.com/functions"+"/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits"+"&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=#{page}"
doc = Nokogiri::HTML(open(url))
doc = Nokogiri::HTML(doc.at_css('#ajax').text)
d = doc.css(".rslwrp")
d.each do |t|
puts t.css(".jrcw").text
puts t.css("span.jcn").text
puts t.css(".jaid").text
puts t.css(".estd").text
page+=1
end
end
You have 2 options here:
Switch pure HTTP scraping to some tool which supports javascript evaluation, such as Capybara (with proper driver selected). This can be slow, since you're running headless browser under the hood plus you'll have to set some timeouts or figure another way to make sure the blocks of text you're interested in are loaded before you start any scraping.
Second option is to use Web Developer console and figure out how those blocks of text are loaded (which AJAX calls, their parameters and etc.) and implement them in your scraper. This is more advanced approach, but more performant, since you won't make any extra work, like you've done in option 1.
Have a nice day!
UPDATE:
Your code above doesn't work, because the response is HTML code wrapped in JSON object, while you're trying to parse it as a raw HTML. It looks like this:
{
"error": 0,
"msg": "request successful",
"paidDocIds": "some ids here",
"itemStartIndex": 20,
"lastPageNum": 50,
"markup": 'LOTS AND LOTS AND LOTS OF MARKUP'
}
What you need is unwrap JSON and then parse as HTML:
require 'json'
json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
I'd also advise you against using open-uri since your code may become vulnerable if you use dynamic urls because of the way open-uri works (read the linked article for the details) and use good and more feature-wise libraries such as HTTParty and RestClient.
UPDATE 2: Minimal working script for me:
require 'json'
require 'open-uri'
require 'nokogiri'
url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=2'
json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
puts doc.at_css('#newphoto10').attr('title')
# => Dr Raaj Batra Lal Kitab Expert in East Patel Nagar, Delhi

Manipulate webpage and have it send back results using ruby

so here it is:
I use ruby to get user input, the easy way..say I request 2 inputs:
input1 = gets.chomp
input2 = gets.chomp
Now I would like to send this information to say, a search engine that takes these two options separately and does the search. How can I do this? What API/Gems will be helpful for me in this case?
I know that i can take these 2 inputs and insert them into the url but its not that simple because according to the inputs the url structure is not constant..(I wouldn't want to use this way though..)
Its been a long time since I lat programmed in ruby, I know how to access webpages and things like that, but I want to manipulate and receive back. Any Ideas?
If you are talking about some front-end of a site without any API access or sophisticated JS logic, you could simply use mechanize gem which allows you to do something like:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://google.com/')
form = page.forms.first
form['field_name_1'] = input1
form['field_name_2'] = input2
page = agent.submit(form, form.buttons.first)
puts page
→ Check out the official documentation for more examples
If you are going to use third party REST API you should better try something like faraday or other popular gems (depending on your taste and particular task).
Please correct me if I misunderstood you.
From what I understand, you want to encode your two inputs in a URL, send them to an API and receive the results back.
You can use the Net::HTTP library from the Ruby stdlib. Here's the example with dynamic parameters from the docs:
uri = URI('http://example.com/index.html')
params = { :limit => 10, :page => 3 }
uri.query = URI.encode_www_form(params)
res = Net::HTTP.get_response(uri)
puts res.body if res.is_a?(Net::HTTPSuccess)
Or you can use some gems to wrap it up for you. HTTParty seems quite popular. You can do it as simple as
HTTParty.get('http://foo.com/resource.json', query: {limit: 10})

how to extract these specific links usuing mechanize in Ruby?

I've been trying but I cant get these specific links on this page:
http://www.windowsphone.com/en-us/store/top-free-apps
I want to get each one of the links on the left side of this page, entertainment for example, but I cant find the right reference to get them.
it's the script:
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.windowsphone.com/en-us/store/top-free-apps")
page.links_with(???)
what should I put instead of ??? so that I cant get those links?
I've tried stuff like:
page.links_with(:class => 'categoryNav navText')
OR
page.links_with(:class => 'categoryNav')
OR
page.links_with(:class => 'navText')
etc
can anyone help please?
Using page.parser, you can access the underlying Nokogiri object. This allows you to use xpath to make your search.
The idea here is that all those links to have a 'data-ov' attribute that starts with 'AppLeftMerch'. This is something we can use to identify them using the 'starts-with' function.
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.windowsphone.com/en-us/store/top-free-apps")
page.parser.xpath("//a[starts-with(#data-ov,'AppLeftMerch')]").each do |link|
puts link[:href]
end

Ruby Mechanize login not working

Let me set the stage for what I'm trying to accomplish. In a physics class I'm taking, my teacher always likes to brag about how impossible it is to cheat in her class, because all of her assignments are done through WebAssign. The way WebAssign works is this: Everyone gets the same questions, but the numbers used in the question are random variables, so each student has different numbers, thus a different answer. So I've been writing ruby scripts to solve the question's for people by just imputing your specific numbers.
I would like to automate this process using mechanize. I've used mechanize plenty of times before, but I'm having trouble logging in to the site. I'll submit the form and it returns the same page I was just on. You can take a look at the site's source code, at http://webassign.net, and I've also tried using the login at http://webassign.net/login.html with no luck either.
Let me follow all of this up with some ruby code that doesn't do what I want it to:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.webassign.net/login.html")
form = page.forms.last
puts "Enter your username"
form.WebAssignUsername = gets.chomp
puts "Enter your password (Don't worry, we don't save this)"
form.WebAssignPassword = gets.chomp
form.WebAssignInstitution = "trinityvalley.tx"
form.submit #=> Returns original page
If anyone really takes an interest in getting this to work, I would be more than happy to send them a working username and password.
The site could be checking that the Login post variable is set (see the login button). Try adding form.Login = "Login".
Have you tried to use agent.submit(form, form.buttons.first) instead of form.submit?
This worked for me when I tried to submit a form. I tried using form.submit first and it kept returning the original page.
Try setting the user agent:
agent = Mechanize.new do |a|
a.user_agent_alias = 'Mac Safari'
end
Some sites seem to require that.
Your question seems a little ambiguous, saying that you're not having any luck? What is the problem exactly? Are you getting a different response entirely than when you view the page in a browser? If so, then do what #cam says and analyzer the headers, you can do it in Firefox via an extension, or you can do it in Chrome natively. Either way, try to mimic the headers that you see in whatever browser you are doing in you mechanize user agent. Here is a script that I used to mimic the iTunes request headers when I was data-mining the app-store:
def mimic_itunes( mech_agent )
mech_agent.pre_connect_hooks << lambda {|headers|
headers[:request]['X-Apple-Store-Front'] = X_APPLE_STOREFRONT;
headers[:request]['X-Apple-Tz'] = X_APPLE_TZ;
headers[:request]['X-Apple-Validation'] = X_APPLE_VALIDATION;
}
mech_agent.user_agent = 'iTunes/9.1.1 (Windows; Microsoft Windows 7 x64 Business Edition (Build 7600)) AppleWebKit/531.22.7'
mech_agent
end
Note: the constants in the example are just strings... not really that important what they are, as long as you know you can add any string there
Using this approach, you should be able to alter/add any headers that the web application might need.
If this is not the problem that you are having, then post more in-depth details of what exactly is happening.

Resources