Ruby + watir: format cookie for web crawler - ruby

I have started building a web scraper to get some data for an API, however the url requires a 'continue as guest' for each daily visit. This is the cookie I from the request header:
web_site=eyJpdiI6Im0ySWpxMFdLVW56RVk0NzU5cEZNZlE9PSIsInZhbHVlIjoiWU5XVWNzZTlMejEzOHFhQ0hsSkVlbmZxbm5zeWNuU0tGZ1hNOUNuUGZQdFFJUDB2M2M4NFRzbGx2WjdMZ1NTQ0dUZ0ZqRmNUVmk5XC9JeThVTlhVZm1RPT0iLCJtYWMiOiI4ZTBhZjUxZWFmNDhiZjI0MzE1OGMyNjY2N2EyOTkyODM4N2NlYWIzZTBmNmI2ZmExMDg5ZTcyYTY3MzM2MWZiIn0%3D;
expires=Tue, 01-Mar-2022 20:00:34 GMT; Max-Age=7200;
path=/;samesite=none;Secure; domain=.website.co.uk
So i am trying to add this cookie to my request but can't seem to get the formatting right (or am doing something else wrong).
This
browser.cookies.add('openplay_session', 'eyJpdiI6Ill6RFRPeXJzcktzcndCM3pEM1I0S3c9PSIsInZhbHVlIjoiN1pCRDZjaXRBbUpSaU1BOFhEVkVld0tsWXlcL0l6RlNISmdoQ25JWnpcL3BxNVREWWFaWk9kY1wvTDVFMUI0aW81TmptN0ppTFlISkxMclhianB1WlFRWnc9PSIsIm1hYyI6IjE1NTVjZjJjZTQzYzQ0ODdmZTdlODY2ZDBjODdjNzYyY2Q5ZGJmMDYxNjIyOGUzZjU3N2JjZWYwMjRjZjVjNzUifQ%3D%3D', path='/', Secure = true, domain='.openplay.co.uk')
returns:
`add': wrong number of arguments (given 5, expected 2..3)
(ArgumentError)
Full code so far (not complete as I cant access the tables content w/o the cookie):
require 'watir'
require 'webdrivers'
require 'nokogiri'
require 'securerandom'
browser = Watir::Browser.new
browser.goto 'https://www.website.co.uk/booking'
browser.cookies.add('open_session', 'eyJpdiI6Ill6RFRPeXJzcktzcndCM3pEM1I0S3c9PSIsInZhbHVlIjoiN1pCRDZjaXRBbUpSaU1BOFhEVkVld0tsWXlcL0l6RlNISmdoQ25JWnpcL3BxNVREWWFaWk9kY1wvTDVFMUI0aW81TmptN0ppTFlISkxMclhianB1WlFRWnc9PSIsIm1hYyI6IjE1NTVjZjJjZTQzYzQ0ODdmZTdlODY2ZDBjODdjNzYyY2Q5ZGJmMDYxNjIyOGUzZjU3N2JjZWYwMjRjZjVjNzUifQ%3D%3D', path='/', domain='.website.co.uk')
parsed_page = Nokogiri::HTML(browser.html)
File.open("parsed.txt", "w") { |f| f.write "#{parsed_page}" }
puts parsed_page.title
links = parsed_page.css('a')
links.map {|e| e["href"]}
puts links
browser.close

Those last 3 parameters need to be inside a Hash.
browser.cookies.add('openplay_session', value, {path: '/', secure: true, domain: '.openplay.co.uk'})
https://github.com/watir/watir/blob/main/lib/watir/cookies.rb#L40-L55

Related

want to get taobao's list of URL of products on search result page without taobao API

I want to get taobao's list of URL of products on search result page without taobao API.
I tried following Ruby script.
require "open-uri"
require "rubygems"
require "nokogiri"
url='https://world.taobao.com/search/search.htm?_ksTS=1517338530524_300&spm=a21bp.7806943.20151106.1&search_type=0&_input_charset=utf-8&navigator=all&json=on&q=%E6%99%BA%E8%83%BD%E6%89%8B%E8%A1%A8&cna=htqfEgp0pnwCATyQWEDB%2FRCE&callback=__jsonp_cb&abtest=_AB-LR517-LR854-LR895-PR517-PR854-PR895'
charset = nil
html = open(url) do |f|
charset = f.charset
f.read
end
doc = Nokogiri::HTML.parse(html, nil, charset)
p doc.xpath('//*[#id="list-itemList"]/div/div/ul/li[1]/div/div[1]/div/a/#href').each{|i| puts i.text}
# => 0
I want to get list of URL like https://click.simba.taobao.com/cc_im?p=%D6%C7%C4%DC%CA%D6%B1%ED&s=328917633&k=525&e=lDs3%2BStGrhmNjUyxd8vQgTvfT37ERKUkJtUYVk0Fu%2FVZc0vyfhbmm9J7EYm6FR5sh%2BLS%2FyzVVWDh7%2FfsE6tfNMMXhI%2B0UDC%2FWUl0TVvvELm1aVClOoSyIIt8ABsLj0Cfp5je%2FwbwaEz8tmCoZFXvwyPz%2F%2ByQnqo1aHsxssXTFVCsSHkx4WMF4kAJ56h9nOp2im5c3WXYS4sLWfJKNVUNrw%2BpEPOoEyjgc%2Fum8LOuDJdaryOqOtghPVQXDFcIJ70E1c5A%2F3bFCO7mlhhsIlyS%2F6JgcI%2BCdFFR%2BwwAwPq4J5149i5fG90xFC36H%2B6u9EBPvn2ws%2F3%2BHHXRqztKxB9a0FyA0nyd%2BlQX%2FeDu0eNS7syyliXsttpfoRv3qrkLwaIIuERgjVDODL9nFyPftrSrn0UKrE5HoJxUtEjsZNeQxqovgnMsw6Jeaosp7zbesM2QBfpp6NMvKM5e5s1buUV%2F1AkICwRxH7wrUN4%2BFn%2FJ0%2FIDJa4fQd4KNO7J5gQRFseQ9Z1SEPDHzgw%3D however I am getting 0
What should I do?
I don't know taobao.com but the page seems like its running lots of javascript. So perhaps the content can actually not be retrieved with a client without javascript capabilities. So instead of open-uri, you could try the gem selenium-webdriver:
https://rubygems.org/gems/selenium-webdriver/versions/2.53.4

Ruby crawl site, add URL parameter

I am trying to crawl a site and append a URL parameter to each address before hitting them. Here's what I have so far:
require "spidr"
Spidr.site('http://www.example.com/') do |spider|
spider.every_url { |url| puts url }
end
But I'd like the spider to hit all pages and append a param like so:
example.com/page1?var=param1
example.com/page2?var=param1
example.com/page3?var=param1
UPDATE 1 -
Tried this, not working though, errors out ("405 method not allowed") after a few iterations:
require "spidr"
require "open-uri"
Spidr.site('http://example.com') do |spider|
spider.every_url do |url|
link= url+"?foo=bar"
response = open(link).read
end
end
Instead of relying on Spidr, I just grabbed a CSV of the URLs I needed from Google Analytics, then ran thru those. Got the job done.
require 'csv'
require 'open-uri'
CSV.foreach(File.path("the-links.csv")) do |row|
link = "http://www.example.com"+row[0]+"?foo=bar"
encoded_url = URI.encode(link)
response = open(encoded_url).read
puts encoded_url
puts
end

Testing filepicker.io security using Ruby

I'm trying to build a test that will allow me to exercise FilePicker.io security. The code is run as:
ruby test.rb [file handle]
and the result is the query string that I can append to a FilePicker URL. I'm pretty sure my policy is getting read properly, but my signature isn't. Can someone tell me what I'm doing wrong? Here's the code:
require 'rubygems'
require 'base64'
require 'cgi'
require 'openssl'
require 'json'
handle = ARGV[0]
expiry = Time::now.to_i + 3600
policy = {:handle=>handle, :expiry=>expiry, :call=>["pick","read", "stat"]}.to_json
puts policy
puts "\n"
secret = 'SECRET'
encoded_policy = CGI.escape(Base64.encode64(policy))
signature = OpenSSL::HMAC.hexdigest('sha256', secret, encoded_policy)
puts "?signature=#{signature}&policy=#{encoded_policy}"
The trick is to use Base64.urlsafe_encode64 instead of CGI.escape:
require 'rubygems'
require 'base64'
require 'cgi'
require 'openssl'
require 'json'
handle = ARGV[0]
expiry = Time::now.to_i + 3600
policy = {:handle=>handle, :expiry=>expiry}.to_json
puts policy
puts "\n"
secret = 'SECRET'
encoded_policy = Base64.urlsafe_encode64(policy)
signature = OpenSSL::HMAC.hexdigest('sha256', secret, encoded_policy)
puts "?signature=#{signature}&policy=#{encoded_policy}"
When tested with the sample values for expiry, handle, and secret in the Filepicker.io docs it returns same values as the python example.
I resolved this in my Ruby 1.8 environment by removing the CGI.escape and gsubbing out the newline:
Base64.encode64(policy).gsub("\n","")
elevenarms's answer is the best for Ruby 1.9 users, but you have to do something a bit kludgy like the above for Ruby 1.8. I'll accept his answer nonetheless, since most of us are or shortly will be in 1.9 these days.

Why doesn't Nokogiri load the full page?

I'm using Nokogiri to open Wikipedia pages about various countries, and then extracting the names of these countries in other languages from the interwiki links (links to foreign-language wikipedias). However, when I try to open the page for France, Nokogiri does not download the full page. Maybe it's too large, anyway it doesn't contain the interwiki links that I need. How can I force it to download all?
Here's my code:
url = "http://en.wikipedia.org/wiki/" + country_name
page = nil
begin
page = Nokogiri::HTML(open(url))
rescue OpenURI::HTTPError=>e
puts "No article found for " + country_name
end
language_part = page.css('div#p-lang')
Test:
with country_name = "France"
=> []
with country_name = "Thailand"
=> really long array that I don't want to quote here,
but containing all the right data
Maybe this issue goes beyond Nokogiri and into OpenURI - anyway I need to find a solution.
Nokogiri does not retrieve the page, it asks OpenURI to do it with an internal read on the StringIO object that Open::URI returns.
require 'open-uri'
require 'zlib'
stream = open('http://en.wikipedia.org/wiki/France')
if (stream.content_encoding.empty?)
body = stream.read
else
body = Zlib::GzipReader.new(stream).read
end
p body
Here's what you can key off of:
>> require 'open-uri' #=> true
>> open('http://en.wikipedia.org/wiki/France').content_encoding #=> ["gzip"]
>> open('http://en.wikipedia.org/wiki/Thailand').content_encoding #=> []
In this case if it's [], AKA "text/html", it reads. If it's ["gzip"] it decodes.
Doing all the stuff above and tossing it to:
require 'nokogiri'
page = Nokogiri::HTML(body)
language_part = page.css('div#p-lang')
should get you back on track.
Do this after all the above to confirm visually you're getting something usable:
p language_part.text.gsub("\t", '')
See Casper's answer and comments about why you saw two different results. Originally it looked like Open-URI was inconsistent in its processing of the returned data, but based on what Casper said, and what I saw using curl, Wikipedia isn't honoring the "Accept-Encoding" header for large documents and returns gzip. That is fairly safe with today's browsers but clients like Open-URI that don't automatically sense the encoding will have problems. That's what the code above should help fix.
After quite a bit of head scratching the problem is here:
> wget -S 'http://en.wikipedia.org/wiki/France'
Resolving en.wikipedia.org... 91.198.174.232
Connecting to en.wikipedia.org|91.198.174.232|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.0 200 OK
Content-Language: en
Last-Modified: Fri, 01 Jul 2011 23:31:36 GMT
Content-Encoding: gzip <<<<------ BINGO!
...
You need to unpack the gzipped data, which open-uri does not do automatically.
Solution:
def http_get(uri)
url = URI.parse uri
res = Net::HTTP.start(url.host, url.port) { |h|
h.get(url.path)
}
headers = res.to_hash
gzipped = headers['content-encoding'] && headers['content-encoding'][0] == "gzip"
content = gzipped ? Zlib::GzipReader.new(StringIO.new(res.body)).read : res.body
content
end
And then:
page = Nokogiri::HTML(http_get("http://en.wikipedia.org/wiki/France"))
require 'open-uri'
require 'zlib'
open('Accept-Encoding' => 'gzip, deflate') do |response|
if response.content_encoding.include?('gzip')
response = Zlib::GzipReader.new(response)
response.define_singleton_method(:method_missing) do |name|
to_io.public_send(name)
end
end
yield response if block_given?
response
end

How to manually add a cookie to Mechanize state?

I'm working in Ruby, but my question is valid for other languages as well.
I have a Mechanize-driven application. The server I'm talking to sets a cookie using JavaScript (rather than standard set-cookie), so Mechanize doesn't catch the cookie. I need to pass that cookie back on the next GET request.
The good news is that I already know the value of the cookie, but I don't know how to tell Mechanize to include it in my next GET request.
I figured it out by extrapolation (and reading sources):
agent = Mechanize.new
...
cookie = Mechanize::Cookie.new(key, value)
cookie.domain = ".oddity.com"
cookie.path = "/"
agent.cookie_jar.add(cookie)
...
page = agent.get("https://www.oddity.com/etc")
Seems to do the job just fine.
update
As #Benjamin Manns points out, Mechanize now wants a URL in the add method. Here's the amended recipe, making the assumption that you've done a GET using the agent, and that the last page visited is the domain for the cookie (saves a URI.parse()):
agent = Mechanize.new
...
cookie = Mechanize::Cookie.new(key, value)
cookie.domain = ".oddity.com"
cookie.path = "/"
agent.cookie_jar.add(agent.history.last.uri, cookie)
These answers are old, so to bring this up to date, these days it looks more like this:
cookie = Mechanize::Cookie.new :domain => '.mydomain.com', :name => name, :value => value, :path => '/', :expires => (Date.today + 1).to_s
agent.cookie_jar << cookie
I wanted to add my experience for specifically passing cookies from Selenium to Mechanize:
Get the cookies from your selenium driver
sel_driver = Selenium::WebDriver.for :firefox
sel_driver.navigate.to('https://sample.com/javascript_login')
#login
sel_cookies = sel_driver.manage.all_cookies
Value for :expires from Selenium cookie is a DateTime object or blank.
However, value for :expires Mechanize cookie (a) must be a string and (b) cannot be blank
sel_cookies.each do |c|
if c[:expires].blank?
c[:expires] = (DateTime.now + 10.years).to_s #arbitrary date in the future
else
c[:expires] = c[:expires].to_s
end
end
Now instantiate as Mechanize cookies and place them in the cookie jar
mech_agent = Mechanize.new
sel_cookies.each { |c| agent.cookie_jar << Mechanize::Cookie.new(c) }
mech_agent.get 'https://sample.com/html_pages'
Also you can try this
Mechanize::Cookie.parse(url, "SessionCookie=#{sessid}",
Logger.new(STDOUT)) { |c| agent.cookie_jar.add(url, c) }
source: http://twitter.com/#!/calebcrane/status/51683884341002240
response.to_hash.fetch("set-cookie").each do |c|
agent.cookie_jar.parse c
end
response here is a native Ruby stdlib thing, like Net::HTTPOK.

Resources