Previously I used mechanize for parsing, but now I'm parsing website that uses javscript and mechanize doesn't support it, so I took selenium. I have to take information about companies from this website but I can get the information only after click on javascript link. I did it with selenium, my parser clicks on javascript, then collects information and here appear problems. As you understand I need to save collected information to the database and I can do this properly only if information will be stored in the variables (e.g. address=.., phone=.., email=.., etc). I select necessary information with SelectorGadget and selenium collects information (driver.find_element(:css, ..), but the information about all the companies is located in a single selector (.p2 div)
and I can not save the location as a single variable, the phone in the other variable, etc. So my question - is it possible to divide this text and save in the variables?
Photos that illustrate the process:
i.imgur.com/J5dcGZD.png
i.imgur.com/MaBWICZ.png
i.imgur.com/ZDNXhLt.png
Photo with part of html:
http://i.imgur.com/NUa1X97.png
Here is an example page of this site. The site is in Russian so translate it through Google translator
The parser itself (save a bunch of text from each company to the contacts variable):
require 'rubygems'
require 'selenium-webdriver'
driver = Selenium::WebDriver.for :firefox
driver.get "http://www.ypag.ru/cat/komp249/page3880.html"
loop do
driver.find_elements(:css, ".p2 div a").each {|link| link.click}
driver.find_elements(:css, ".p3 a, .firm , .p2 div").each {
|n,r,c|
name = n
region = r
contacts = c
print name.text.center(100)
puts region
puts contacts
}
link = driver.find_element(:xpath, "/html/body/table[5]/tbody/tr/td/a[2]" )[:href]
break if link == "http://www.ypag.ru/cat/komp249/page3780.html"
driver.get "#{link}"
end
Related
I am trying to scrape through the following website :
https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html
to get all of the state statistics on coronavirus.
My code below works:
require 'nokogiri'
require 'open-uri'
require 'httparty'
require 'pry'
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
doc = Nokogiri::HTML.parse(open(url))
total_cases = doc.css("span.count")[0].text
total_deaths = doc.css("span.count")[1].text
new_cases = doc.css("span.new-cases")[0].text
new_deaths = doc.css("span.new-cases")[1].text
However, I am unable to get into the collapsed data/gridcell data.
I have tried searching by the class .aria-label and by the .rt-tr-group class. Any help would be appreciated. Thank you.
Although the answer of Layon Ferreira already states the problem it does not provide the steps needed to load the data.
Like already said in the linked answer the data is loaded asynchronously. This means that the data is not present on the initial page and is loaded through the JavaScript engine executing code.
When you open up the browser development tools, go to the "Network" tab. You can clear out all requests, then refresh the page. You'll get to see a list of all requests made. If you're looking for asynchronously loaded data the most interesting requests are often those of type "json" or "xml".
When browsing through the requests you'll find that the data you're looking for is located at:
https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json
Since this is JSON you don't need "nokogiri" to parse it.
require 'httparty'
require 'json'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
data = JSON.parse(response.body)
When executing the above you'll get the exception:
JSON::ParserError ...
This seems to be a Byte Order Mark (BOM) that is not removed by HTTParty. Most likely because the response doesn't specify an UTF-8 charset.
response.body[0]
#=> ""
format '%X', response.body[0].ord
#=> "FEFF"
To correctly handle the BOM Ruby 2.7 added the set_encoding_by_bom method to IO which is also available on StringIO.
require 'httparty'
require 'json'
require 'stringio'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
body = StringIO.new(response.body)
body.set_encoding_by_bom
data = JSON.parse(body.gets(nil))
#=> [{"Jurisdiction"=>"Alabama", "Range"=>"10,001 to 20,000", "Cases Reported"=>10145, ...
If you're not yet using Ruby 2.7 you can use a substitute to remove the BOM, however the former is probably the safer option:
data = JSON.parse(response.body.force_encoding('utf-8').sub(/\A\xEF\xBB\xBF/, ''))
That page is using AJAX to load its data.
in that case you may use Watir to fetch the page using a browser
as answered here: https://stackoverflow.com/a/13792540/2784833
Another way is to get data from the API directly.
You can see the other endpoints by checking the network tab on your browser console
I replicated your code and found some of the errors that you might have done
require 'HTTParty'
will not work. You need to use
require 'httparty'
Secondly, there should be quotes around your variable url value i.e
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
Other than that, it just worked fine for me.
Also, if you're trying to get the Covid-19 data you might want to use these APIs
For US Count
For US Daily Count
For US Count - States
You could learn more about the APIs here
Here is the code which i used for parsing of web page.I did it in rails console.But i am not getting any output in my rails console.The site which i want to scrape is having lazy loading
require 'nokogiri'
require 'open-uri'
page = 1
while true
url = "http://www.justdial.com/functions"+"/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits"+"&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=#{page}"
doc = Nokogiri::HTML(open(url))
doc = Nokogiri::HTML(doc.at_css('#ajax').text)
d = doc.css(".rslwrp")
d.each do |t|
puts t.css(".jrcw").text
puts t.css("span.jcn").text
puts t.css(".jaid").text
puts t.css(".estd").text
page+=1
end
end
You have 2 options here:
Switch pure HTTP scraping to some tool which supports javascript evaluation, such as Capybara (with proper driver selected). This can be slow, since you're running headless browser under the hood plus you'll have to set some timeouts or figure another way to make sure the blocks of text you're interested in are loaded before you start any scraping.
Second option is to use Web Developer console and figure out how those blocks of text are loaded (which AJAX calls, their parameters and etc.) and implement them in your scraper. This is more advanced approach, but more performant, since you won't make any extra work, like you've done in option 1.
Have a nice day!
UPDATE:
Your code above doesn't work, because the response is HTML code wrapped in JSON object, while you're trying to parse it as a raw HTML. It looks like this:
{
"error": 0,
"msg": "request successful",
"paidDocIds": "some ids here",
"itemStartIndex": 20,
"lastPageNum": 50,
"markup": 'LOTS AND LOTS AND LOTS OF MARKUP'
}
What you need is unwrap JSON and then parse as HTML:
require 'json'
json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
I'd also advise you against using open-uri since your code may become vulnerable if you use dynamic urls because of the way open-uri works (read the linked article for the details) and use good and more feature-wise libraries such as HTTParty and RestClient.
UPDATE 2: Minimal working script for me:
require 'json'
require 'open-uri'
require 'nokogiri'
url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=2'
json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
puts doc.at_css('#newphoto10').attr('title')
# => Dr Raaj Batra Lal Kitab Expert in East Patel Nagar, Delhi
I've been trying but I cant get these specific links on this page:
http://www.windowsphone.com/en-us/store/top-free-apps
I want to get each one of the links on the left side of this page, entertainment for example, but I cant find the right reference to get them.
it's the script:
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.windowsphone.com/en-us/store/top-free-apps")
page.links_with(???)
what should I put instead of ??? so that I cant get those links?
I've tried stuff like:
page.links_with(:class => 'categoryNav navText')
OR
page.links_with(:class => 'categoryNav')
OR
page.links_with(:class => 'navText')
etc
can anyone help please?
Using page.parser, you can access the underlying Nokogiri object. This allows you to use xpath to make your search.
The idea here is that all those links to have a 'data-ov' attribute that starts with 'AppLeftMerch'. This is something we can use to identify them using the 'starts-with' function.
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.windowsphone.com/en-us/store/top-free-apps")
page.parser.xpath("//a[starts-with(#data-ov,'AppLeftMerch')]").each do |link|
puts link[:href]
end
I'm trying to log into Google, so that I can scrape & migrate a private google group.
It doesn't seem to log in over SSL. Any ideas appreciated. I'm using Mechanize and the code is below:
group_signin_url = "https://login page to goolge, with referrer url to a private group here"
user = ENV['GOOGLE_USER']
password = ENV['GOOGLE_PASSWORD']
scraper = Mechanize.new
scraper.user_agent = Mechanize::AGENT_ALIASES["Linux Firefox"]
scraper.agent.http.verify_mode = OpenSSL::SSL::VERIFY_NONE
page = scraper.get group_signin_url
google_form = page.form
google_form.Email = user
google_form.Passwd = password
group_page = scraper.submit(google_form, google_form.buttons.first)
pp group_page
I worked with Ian (the OP) on this problem and just felt we should close this thread with some answers based on what we found when we spent some more time on the problem.
1) You can't scrape a Google Group with Mechanize. We managed to get logged in abut the content of the Google Group pages is all rendered in-browser, meaning that HTTP requests, such as issued by Mechanize, are returned with a few links and no actual content.
We found that we could get page content by the use of Selenium (we used Selenium in Firefox, using the Ruby bindings).
2) the HTML element IDs/classes in Google Groups are obfuscated but we found that these Selenium commands will pull out the bits you need (until Google change them)
message snippets (click on them to expand messages)
find_elements(:class, 'GFP-UI5CCLB')
elements with name of author
find_elements(:class, 'GFP-UI5CA1B')
elements with content of post
find_elements(:class, 'GFP-UI5CCKB')
elements containing date
find_elements(:class, 'GFP-UI5CDKB') (and then use the attribute[:title] for a full length date string)
3) I have some Ruby code here which scrapes the content programmatically and uploads it into a Discourse forum (which is what we were trying to migrate to).
It's hacky but it kind of works. I recently migrated 2 commercially important Google Groups using this script. I'm up for taking on 'We Scrape Your Google Group' type work, please PM me.
I'm trying to parse the Twitter usernames from a bit.ly stats page using Nokogiri:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://bitly.com/U026ue+/global'))
twitter_accounts = []
shares = doc.xpath('//*[#id="tweets"]/li')
shares.map do |tweet|
twitter_accounts << tweet.at_css('.conv.tweet.a')
end
puts twitter_accounts
My understanding is that Nokogiri will save shares in some form of tree structure, which I can use to drill down into, but my mileage is varying.
That data is coming in from an Ajax request with a JSON response. It's pretty easy to get at though:
require 'json'
url = 'http://search.twitter.com/search.json?_usragnt=Bitly&include_entities=true&rpp=100&q=nowness.com%2Fday%2F2012%2F12%2F6%2F2643'
hash = JSON.parse open(url).read
puts hash['results'].map{|x| x['from_user']}
I got that URL by loading the page in Chrome and then looking at the network panel, I also removed the timestamp and callback parameters just to clean things up a bit.
Actually, Eric Walker was onto something. If you look at doc, the section where the tweets are supposed to be look like:
<h2>Tweets</h2>
<ul id="tweets"></ul>
</div>
This is likely because they're generated by some JavaScript call which Nokogiri isn't executing. One possible solution is to use watir to traverse to the page, load the JavaScript and then save the HTML.
Here is a script that accomplishes just that. Note that you had some issues with your XPath arguments which I've since solved, and that watir will open a new browser every time you run this script:
require 'watir'
require 'nokogiri'
browser = Watir::Browser.new
browser.goto 'http://bitly.com/U026ue+/global'
doc = Nokogiri::HTML.parse(browser.html)
twitter_accounts = []
shares = doc.xpath('//li[contains(#class, "tweet")]/a')
shares.each do |tweet|
twitter_accounts << tweet.attr('title')
end
puts twitter_accounts
browser.close
You can also use headless to prevent a window from opening.