I'm having a trouble with Mechanize gem, how to convert Mechanize::File into Mechanize::Page,
here's my piece of code:
**link** = page.link_with(:href => %r{/en/users}).click
when users link clicked it goes to the page with the list of users, now i want to click the first user, but i can't achieve this, because link return Mechanize::File object
Any help, suggestions 'd be great, thanks
Mechanize uses Content-Type to determine how the resource should be handled. Occasionally websites will not set the mime-types for their resources. Mechanize::File is the default for unset Content-Type.
If you are only dealing with 'text/html' you can following Jimm Stout's suggestion of using post_connect_hooks
agent = Mechanize.new do |a|
a.post_connect_hooks << ->(_,_,response,_) do
if response.content_type.empty?
response.content_type = 'text/html'
end
end
end
Just parse the body with nokogiri:
link = page.link_with(:href => %r{/en/users}).click
doc = Nokogiri::HTML link.body
agent.get doc.at('a')[:href]
Related
I want to get taobao's list of URL of products on search result page without taobao API.
I tried following Ruby script.
require "open-uri"
require "rubygems"
require "nokogiri"
url='https://world.taobao.com/search/search.htm?_ksTS=1517338530524_300&spm=a21bp.7806943.20151106.1&search_type=0&_input_charset=utf-8&navigator=all&json=on&q=%E6%99%BA%E8%83%BD%E6%89%8B%E8%A1%A8&cna=htqfEgp0pnwCATyQWEDB%2FRCE&callback=__jsonp_cb&abtest=_AB-LR517-LR854-LR895-PR517-PR854-PR895'
charset = nil
html = open(url) do |f|
charset = f.charset
f.read
end
doc = Nokogiri::HTML.parse(html, nil, charset)
p doc.xpath('//*[#id="list-itemList"]/div/div/ul/li[1]/div/div[1]/div/a/#href').each{|i| puts i.text}
# => 0
I want to get list of URL like https://click.simba.taobao.com/cc_im?p=%D6%C7%C4%DC%CA%D6%B1%ED&s=328917633&k=525&e=lDs3%2BStGrhmNjUyxd8vQgTvfT37ERKUkJtUYVk0Fu%2FVZc0vyfhbmm9J7EYm6FR5sh%2BLS%2FyzVVWDh7%2FfsE6tfNMMXhI%2B0UDC%2FWUl0TVvvELm1aVClOoSyIIt8ABsLj0Cfp5je%2FwbwaEz8tmCoZFXvwyPz%2F%2ByQnqo1aHsxssXTFVCsSHkx4WMF4kAJ56h9nOp2im5c3WXYS4sLWfJKNVUNrw%2BpEPOoEyjgc%2Fum8LOuDJdaryOqOtghPVQXDFcIJ70E1c5A%2F3bFCO7mlhhsIlyS%2F6JgcI%2BCdFFR%2BwwAwPq4J5149i5fG90xFC36H%2B6u9EBPvn2ws%2F3%2BHHXRqztKxB9a0FyA0nyd%2BlQX%2FeDu0eNS7syyliXsttpfoRv3qrkLwaIIuERgjVDODL9nFyPftrSrn0UKrE5HoJxUtEjsZNeQxqovgnMsw6Jeaosp7zbesM2QBfpp6NMvKM5e5s1buUV%2F1AkICwRxH7wrUN4%2BFn%2FJ0%2FIDJa4fQd4KNO7J5gQRFseQ9Z1SEPDHzgw%3D however I am getting 0
What should I do?
I don't know taobao.com but the page seems like its running lots of javascript. So perhaps the content can actually not be retrieved with a client without javascript capabilities. So instead of open-uri, you could try the gem selenium-webdriver:
https://rubygems.org/gems/selenium-webdriver/versions/2.53.4
I am trying to scrape a website and store data in XML using Mechanize and Nokogiri. I didn't set up a Rails project and I am only using Ruby and IRB.
I wrote this method:
def mechanize_club
agent = Mechanize.new
agent.get("http://www.rechercheclub.applipub-fft.fr/rechercheclub/")
form = agent.page.forms.first
form.field_with(:name => 'codeLigue').options[0].select
form.submit
page2 = agent.get('http://www.rechercheclub.applipub-fft.fr/rechercheclub/club.do?codeClub=01670001&millesime=2015')
body = page2.body
html_body = Nokogiri::HTML(body)
codeclub = html_body.search('.form').children("tr:first").children("th:first").to_i
#codeclubs << codeclub
filepath = '/davidgeismar/Documents/codeclubs.xml'
builder = Nokogiri::XML::Builder.new(encoding: 'UTF-8') do |xml|
xml.root {
xml.codeclubs {
#codeclubss.each do |c|
xml.codeclub {
xml.code_ c.code
}
end
}
}
end
puts builder.to_xml
end
My first problem is that I don't know how to test my code.
I call ruby webscraper.rb in my console, the file is treated I think, but it doesn't create an XML file in the specified path.
Then, more specifically I am quite sure this code is wrong as I didn't get a chance to test it.
Basically what I am trying to do is to submit a form several times:
agent = Mechanize.new
agent.get("http://www.rechercheclub.applipub-fft.fr/rechercheclub/")
form = agent.page.forms.first
form.field_with(:name => 'codeLigue').options[0].select
form.submit
I think this code is ok, but I dont want it to only select options[0], I want it to select an option, then scrape all the data I need, then go back to page, then select options[1]... until there are no more options (an iteration I guess).
the file is treated I think, but it doesnt create an xml file in the specified path.
There is nothing in your code that creates a file. You print some output, but don't do anything to open or write a file.
Perhaps you should read the IO and File documentation and review how you are using your filepath variable?
The second problem is that you don't call your method anywhere. Though it's defined and Ruby will see it and parse the method, it has no idea what you want to do with it unless you invoke the method:
def mechanize_club
...
end
mechanize_club()
Is there a way of accessing this dialog box to get the file name or to save this file somewhere so i can access it later. I am using Ruby mechanize to navigate through the website to get to this screen.
There is no dialog with mechanize. You submit the form, that returns a Mechanize::File object, and you can then save that like so:
file = form.submit
File.open('myfile','w'){|f| f << file.body}
I would do it this way.
Use nokogiri to open the page:
#doc = Nokogiri::HTML(open(url))
go through the doc page and find that link for download.
then you can use something link this:
require 'net/http'
Net::HTTP.start('theserver.com') { |http|
resp = http.get('/xx/the_file_to_downlaod.csv')
open('the_downlaod.csv', 'wb') { |file|
file.write(resp.body)
}
}
Is there a straightforward way to set custom headers with Mechanize 2.3?
I tried a former solution but get:
$agent = Mechanize.new
$agent.pre_connect_hooks << lambda { |p|
p[:request]['Referer'] = 'https://wwws.mysite.com/cgi-bin/apps/Main'
}
# ./mech.rb:30:in `<main>': undefined method `pre_connect_hooks' for nil:NilClass (NoMethodError)
The docs say:
get(uri, parameters = [], referer = nil, headers = {}) { |page| ... }
so for example:
agent.get 'http://www.google.com/', [], agent.page.uri, {'foo' => 'bar'}
alternatively you might like:
agent.request_headers = {'foo' => 'bar'}
agent.get url
You misunderstood the code you were copying. There was a newline in the example, but it disappeared in the formatting as it wasn't tagged as code. $agent contains nil since you're trying to use it before it has been initialized. You must initialize the object and then use it. Just try this:
$agent = Mechanize.new
$agent.pre_connect_hooks << lambda { |p| p[:request]['Referer'] = 'https://wwws.mysite.com/cgi-bin/apps/Main' }
For this question I noticed people seem to use:
page = agent.get("http://www.you.com/index_login/", :referer => "http://www.you.com/")
As an aside, now that I tested this answer, it seems this was not the issue behind my actual problem: that every visit to a site I'm scraping requires going through the login sequence pages again, even seconds later after the first logged-in visit, despite that I'm always loading and saving the complete cookie jar in yaml format. But that would lead to another question of course.
I found some solutions using post_connect_hook and pre_connect_hook, but it seems like they don't work. I'm using the latest Mechanize version (2.1). There are no [:response] fields in the new version, and I don't know where to get them in the new version.
https://gist.github.com/search?q=pre_connect_hooks
https://gist.github.com/search?q=post_connect_hooks
Is it possible to make Mechanize return a UTF8 encoded version, instead of having to convert it manually using iconv?
Since Mechanize 2.0, arguments of pre_connect_hooks() and post_connect_hooks() were changed.
See the Mechanize documentation:
pre_connect_hooks()
A list of hooks to call before retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.
post_connect_hooks()
A list of hooks to call after retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.
Now you can't change the internal response-body value because an argument is not array. So, the next best way is to replace an internal parser with your own:
class MyParser
def self.parse(thing, url = nil, encoding = nil, options = Nokogiri::XML::ParseOptions::DEFAULT_HTML, &block)
# insert your conversion code here. For example:
# thing = NKF.nkf("-wm0X", thing).sub(/Shift_JIS/,"utf-8") # you need to rewrite content charset if it exists.
Nokogiri::HTML::Document.parse(thing, url, encoding, options, &block)
end
end
agent = Mechanize.new
agent.html_parser = MyParser
page = agent.get('http://somewhere.com/')
...
I found a solution that works pretty well:
class HtmlParser
def self.parse(body, url, encoding)
body.encode!('UTF-8', encoding, invalid: :replace, undef: :replace, replace: '')
Nokogiri::HTML::Document.parse(body, url, 'UTF-8')
end
end
Mechanize.new.tap do |web|
web.html_parser = HtmlParser
end
No issues were found yet.
In your script, just enter: page.encoding = 'utf-8'
However, depending on your scenario, you may alternatively need to enter the reverse (the encoding of the website Mechanize is working with) instead. For that, open Firefox, open the website you want Mechanize to work with, select Tools in the menubar, and then open Page Info. Determine what the page is encoded in from there.
Using that info, you would instead enter what the page is encoded in (such as page.encoding = 'windows-1252').
How about something like this:
class Mechanize
alias_method :original_get, :get
def get *args
doc = original_get *args
doc.encoding = 'utf-8'
doc
end
end