Ruby substract a link from a link - ruby

i have link like this below. this is generated by some machine.
link = 'https://bt.sandal.com/promo/v1/clicks/8a-xgVY2gmUE6AUd6AyR6AJDUMVj9RzNrc1i6sJDUSC5rfB7q3YXUstObm-7q3OBUsthosJpHAJO6_yabm-pHOYDQfri6i-B812kgJxGgBBXZSgjH7NDZ325q1OAZ9o-Q1dFyfFN8B29zSBgHMP2_fB-oJhk3_u6uVjh_32VH7OEqRxo8jJF_9B-P7B2PfBiQ_BO3_-o8V2W33BHe72fyfODQMV9o3gqzOgR3A-Q_BNyuPjrc-D692xzpBR3A-Dq7BkQfBoe7BpZ3NcHu2yZsuyHO-JzSoiHfedgjx6Hc-y8AxizJNM_32-HjN_Z9o-QjNkysoGQVKaZSBiHfzE3Bo-QjNkysoGQVKp_Mhg3J2ky1o-ojBk_9x6q_zN_uzS81OEu9Boqjjp_BzC8jBXHA7ibm-SzBu5_uVGgcPirJ1a11UfrcxGH3UNQfBo6_x-oJPmQiUDUMVDgaUEUSo7QuYDHOYDHBYDHZUDUMNOQ3-BrBY5gBYxgIHi6sUFbm-XP3Oig9-wy3zp9R-BrZUEHsnDUMoxPVY2gIHi6BDfHZF7H_yDHsUfou7DUMNwyfVXgcBjy9zB9fVjraUEH_nabm-N9RoOgfPBrRzwy9z7rMBiP9zBUs2QUsUaos1ibmUOH_UibmU7Hs1ibmUaopyNUiFiHsUfHZ-Pwe?r=https%3A%2F%2Fwww.sandal.com%2Fbaru-1%2Fsandal-new-shawllow%3Fsrc%3Dtopsearchs&src=search&is_search=true'
it looks messy it is. i do not know what type link is that called. but inside the link there is actually link like this
https%3A%2F%2Fwww.sandal.com%2Fbaru-1%2Fsandal-new-shawllow
how do i extract the link i wanted from the first link in ruby?.
Thanks guys

You can use the combination of URI.parse and URI.decode_www_form to extract the link from the query parameters.
require 'uri'
parsed_link = URI.parse(link)
query_params = URI.decode_www_form(parsed_link.query).to_h
query_params['r'] # => "https://www.sandal.com/baru-1/sandal-new-shawllow?src=topsearchs"

Related

How to extract data from dynamic collapsing table with hidden elements using Nokogiri and Ruby

I am trying to scrape through the following website :
https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html
to get all of the state statistics on coronavirus.
My code below works:
require 'nokogiri'
require 'open-uri'
require 'httparty'
require 'pry'
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
doc = Nokogiri::HTML.parse(open(url))
total_cases = doc.css("span.count")[0].text
total_deaths = doc.css("span.count")[1].text
new_cases = doc.css("span.new-cases")[0].text
new_deaths = doc.css("span.new-cases")[1].text
However, I am unable to get into the collapsed data/gridcell data.
I have tried searching by the class .aria-label and by the .rt-tr-group class. Any help would be appreciated. Thank you.
Although the answer of Layon Ferreira already states the problem it does not provide the steps needed to load the data.
Like already said in the linked answer the data is loaded asynchronously. This means that the data is not present on the initial page and is loaded through the JavaScript engine executing code.
When you open up the browser development tools, go to the "Network" tab. You can clear out all requests, then refresh the page. You'll get to see a list of all requests made. If you're looking for asynchronously loaded data the most interesting requests are often those of type "json" or "xml".
When browsing through the requests you'll find that the data you're looking for is located at:
https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json
Since this is JSON you don't need "nokogiri" to parse it.
require 'httparty'
require 'json'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
data = JSON.parse(response.body)
When executing the above you'll get the exception:
JSON::ParserError ...
This seems to be a Byte Order Mark (BOM) that is not removed by HTTParty. Most likely because the response doesn't specify an UTF-8 charset.
response.body[0]
#=> ""
format '%X', response.body[0].ord
#=> "FEFF"
To correctly handle the BOM Ruby 2.7 added the set_encoding_by_bom method to IO which is also available on StringIO.
require 'httparty'
require 'json'
require 'stringio'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
body = StringIO.new(response.body)
body.set_encoding_by_bom
data = JSON.parse(body.gets(nil))
#=> [{"Jurisdiction"=>"Alabama", "Range"=>"10,001 to 20,000", "Cases Reported"=>10145, ...
If you're not yet using Ruby 2.7 you can use a substitute to remove the BOM, however the former is probably the safer option:
data = JSON.parse(response.body.force_encoding('utf-8').sub(/\A\xEF\xBB\xBF/, ''))
That page is using AJAX to load its data.
in that case you may use Watir to fetch the page using a browser
as answered here: https://stackoverflow.com/a/13792540/2784833
Another way is to get data from the API directly.
You can see the other endpoints by checking the network tab on your browser console
I replicated your code and found some of the errors that you might have done
require 'HTTParty'
will not work. You need to use
require 'httparty'
Secondly, there should be quotes around your variable url value i.e
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
Other than that, it just worked fine for me.
Also, if you're trying to get the Covid-19 data you might want to use these APIs
For US Count
For US Daily Count
For US Count - States
You could learn more about the APIs here

How to check that a PDF file has some link with Ruby/Rspec?

I am using prawnpdf/pdf-inspector to test that content of a PDF generated in my Rails app is correct.
I would want to check that the PDF file contains a link with certain URL. I looked at yob/pdf-reader but haven't found any useful information related to this topic
Is it possible to test URLs within PDF with Ruby/RSpec?
I would want the following:
expect(urls_in_pdf(pdf)).to include 'https://example.com/users/1'
The https://github.com/yob/pdf-reader contains a method for each page called text.
Do something like
pdf = PDF::Reader.new("tmp/pdf.pdf")
assert pdf.pages[0].text.include? 'https://example.com/users/1'
assuming what you are looking for is at the first page
Since pdf-inspector seems only to return text, you could try to use the pdf-reader directly (pdf-inspector uses it anyways).
reader = PDF::Reader.new("somefile.pdf")
reader.pages.each do |page|
puts page.raw_content # This should also give you the link
end
Anyway I only did a quick look at the github page. I am not sure what raw_content exactly returns. But there is also a low-level method to directly access the objects of the pdf:
reader = PDF::Reader.new("somefile.pdf")
puts reader.objects.inspect
With that it surely is possible to get the url.

Manipulate webpage and have it send back results using ruby

so here it is:
I use ruby to get user input, the easy way..say I request 2 inputs:
input1 = gets.chomp
input2 = gets.chomp
Now I would like to send this information to say, a search engine that takes these two options separately and does the search. How can I do this? What API/Gems will be helpful for me in this case?
I know that i can take these 2 inputs and insert them into the url but its not that simple because according to the inputs the url structure is not constant..(I wouldn't want to use this way though..)
Its been a long time since I lat programmed in ruby, I know how to access webpages and things like that, but I want to manipulate and receive back. Any Ideas?
If you are talking about some front-end of a site without any API access or sophisticated JS logic, you could simply use mechanize gem which allows you to do something like:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://google.com/')
form = page.forms.first
form['field_name_1'] = input1
form['field_name_2'] = input2
page = agent.submit(form, form.buttons.first)
puts page
→ Check out the official documentation for more examples
If you are going to use third party REST API you should better try something like faraday or other popular gems (depending on your taste and particular task).
Please correct me if I misunderstood you.
From what I understand, you want to encode your two inputs in a URL, send them to an API and receive the results back.
You can use the Net::HTTP library from the Ruby stdlib. Here's the example with dynamic parameters from the docs:
uri = URI('http://example.com/index.html')
params = { :limit => 10, :page => 3 }
uri.query = URI.encode_www_form(params)
res = Net::HTTP.get_response(uri)
puts res.body if res.is_a?(Net::HTTPSuccess)
Or you can use some gems to wrap it up for you. HTTParty seems quite popular. You can do it as simple as
HTTParty.get('http://foo.com/resource.json', query: {limit: 10})

how to extract these specific links usuing mechanize in Ruby?

I've been trying but I cant get these specific links on this page:
http://www.windowsphone.com/en-us/store/top-free-apps
I want to get each one of the links on the left side of this page, entertainment for example, but I cant find the right reference to get them.
it's the script:
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.windowsphone.com/en-us/store/top-free-apps")
page.links_with(???)
what should I put instead of ??? so that I cant get those links?
I've tried stuff like:
page.links_with(:class => 'categoryNav navText')
OR
page.links_with(:class => 'categoryNav')
OR
page.links_with(:class => 'navText')
etc
can anyone help please?
Using page.parser, you can access the underlying Nokogiri object. This allows you to use xpath to make your search.
The idea here is that all those links to have a 'data-ov' attribute that starts with 'AppLeftMerch'. This is something we can use to identify them using the 'starts-with' function.
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.windowsphone.com/en-us/store/top-free-apps")
page.parser.xpath("//a[starts-with(#data-ov,'AppLeftMerch')]").each do |link|
puts link[:href]
end

Use ruby mechanize to get data from foursquare

I am trying to use ruby and Mechanize to parse data on foursquare's website. Here is my code:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://foursquare.com')
page = agent.click page.link_with(:text => /Log In/)
form = page.forms[1]
form.F12778070592981DXGWJ = ARGV[0]
form.F1277807059296KSFTWQ = ARGV[1]
page = form.submit form.buttons.first
puts page.body
But then, when I run this code, the following error poped up:
C:/Ruby192/lib/ruby/gems/1.9.1/gems/mechanize-2.0.1/lib/mechanize/form.rb:162:in
`method_missing': undefined method `F12778070592981DXGWJ='
for #<Mechanize::Form:0x2b31f70> (NoMethodError)
from four.rb:10:in `<main>'
I checked and found that these two variables for the form object "F12778070592981DXGWJ" and "F1277807059296KSFTWQ" are changing every time when I try to open foursquare's webpage.
Does any one have the same problem before? your variables change every time you try to open a webpage? How should I solve this problem?
Our project is about parsing the data on foursquare. So I need to be able to login first.
Mechanize is useful for sites which don't expose an API, but Foursquare has an established REST API already. I'd recommend using one of the Ruby libraries, perhaps foursquare2. These libraries abstract away things like authentication, so you just have to register your app and use the provided keys.
Instead of indexing the form fields by their name, just index them by their order. That way you don't have to worry about the name that changes on each request:
form.fields[0].value = ARGV[0]
form.fields[1].value = ARGV[1]
...
However like dwhalen said, using the REST API is probably a much better way. That's why it's there.

Resources