Using Ruby Mechanize to download file served as attachement - ruby

I need the ability to grab reports off of a particular website. The below method below does everything I need it to do, the only catch is the report, "report.csv", is served back with "content-disposition:filename=report.csv" in the response header when the page is posted (the page posts to itself).
def download_report
page = #mechanize.click(#mechanize.current_page().link_with(:text => /Reporting/))
page.form.field_with(:name => "rep").option_with(:value => "adperf").click
page.form_with(:name => "get-report").field_with(:id => "sasReportingQuery.dateRange").option_with(:value => "Custom").click
start_date = DateTime.parse(#start_date)
end_date = DateTime.parse(#end_date)
page.form_with(:name => "get-report").field_with(:name => "sd_display").value = start_date.strftime("%m/%d/%Y")
page.form_with(:name => "get-report").field_with(:name => "ed_display").value = end_date.strftime("%m/%d/%Y")
page.form_with(:name => "get-report").submit
end
As far as I can tell, Mechanize is not capturing the file anywhere that I can get to it. Is there a way to get Mechanize to capture and download this file?
#mechanize.current_page() does not contain the file and #mechanize.history() does not show that the file url was presented to Mechanize.

The server appears to be telling the browser to save the document. "Content-disposition:filename" is the clue to that. Mechanize won't know what to do with that, and will try to read and parse the content, which, if it's a CSV, will not work.
Without seeing the HTML page you're working with it's impossible to know exactly what mechanism they're using to trigger the download. Clicking an element could fire a JavaScript event, which Mechanize won't handle. Or, it could send a form to the server, which responds with the document download. In either case, you have to figure out what is being sent, why, and what specifically defines the document you want, then use that information to request the document.
Mechanize isn't the right tool to download an attachment. Use Mechanize to navigate forms, then use Mechanize's embedded Nokogiri to extract the URL for the document.
Then use something like curb or Ruby's built-in OpenURI to retrieve the attachment, or see "Using WWW:Mechanize to download a file to disk without loading it all in memory first" for more information.

Check the class of the returned page page.class. if it is File then you can just save it.
...
page = page.form_with(:name => "get-report").submit
page.class # File?
page.save('path/to/file')

Related

How to get the current URL for a HTML page

I am scraping a website using Nokogiri. This particular website deals with absolute URLs differently.
If I give it a URL like:
page = Nokogiri::HTML(open(link, :allow_redirections => :all))
it will redirect to the HTTPS version, and also redirect to the long version of the URL. For example, a link like
http://www.website.com/name
turns into
http://www.website.com/other-area/name
This is fine and doesn't really affect my scraper, however, there are certain edge-cases where, if I can tell my scraper what the current URL is, I can avoid them.
After I pass in the above link to my page variable, how can I get the current URL of that page after the redirect happens?
I'm assuming you're using the open_uri_redirections gem because :allow_redirections is not necessary in Ruby 2.4+.
Save the result of OpenURI's open:
require 'open-uri'
r = open('http://www.google.com/gmail')
r.base_uri
# #<URI::HTTPS https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=false&continue=https://mail.google.com/mail/&ss=1&scc=1&ltmpl=default&ltmplcache=2&emr=1&osid=1#>
page = Nokogiri::HTML(r)
Use Mechanize, then you can do:
agent = Mechanize.new
page = agent.get url
puts page.uri # this will be the redirected url

How to use open-uri or paperclip to download images into database and feed them to a Rest API

I am working on a data integration app which need to fetch images from one API (with XML's urls) and post the images to a rails built REST API.
I tried paperclip to download all the images however don't know how to handle the Paperclip::Attachment type when trying to post the images with HTTMultiParty.
I am thinking about use open-uri instead of paperclip which will store file into binary. Can anyone give me an example on that? And is there any good option for posting image to API apart from httmultiparty.
It's better to answer this question myself because the solution can be varied.
So image fetch and feed through api can be done by httparty(download&upload text)+paperclip(download image by url)+httmultiparty(upload image), here are some code example I use in my application.
To me, httparty is easiest way to deal with api, codes can be easily done like this:
response = HTTParty.get('url')
response = HTTParty.post('url',
:headers => 'head content',
:body => {'data':'data content'})
Code example on paperclip is here: answer on stack over flow
The important part is parsing the paperclip image to binary file, code goes:
Paperclip.io_adapters.for(productData[0].image).read
The last example is HTTmultiparty, When you pass a query with an instance of a File as a value for a PUT or POST request, the wrapper will use a bit of magic and multipart-post to execute a multipart upload,apart from that it is pretty much the same as httparty:
class ImgClient
include HTTMultiParty
base_uri 'http://localhost:3000'
end
respond = ImgClient.post('url',
:headers => head,
:query => {
:image => Paperclip.io_adapters.for(product.image)
})
Hope this will be helpful for other api newbies.

How do I search then parse results on a webpage with Ruby?

How would you use Ruby to open a website and do a search in the search field and then parse the results? For example if I entered something into a search engine and then parsed the results page. I know how to use Nokogiri to find the webpage and open it. I am lost on how to input into the search field and moving forward to the results. Also on the page that I am actually searching I have to click on enter, I can't simply hit enter to move forward. Thank you so much for your help.
Use Mechanize - a library used for automating interaction with websites.
Something like mechanize will work, but interacting with the front end UI code is always going to be slower and more problematic than making requests directly against the back end.
Your best bet would be to look at the request that is being made to the server (probably a HTTP GET or POST request with some associated params). You can do this with firebug or Fiddler 2 for windows. Then, once you know the parameters that the server will accept, just make the request yourself.
For example, if you were doing this with the duckduckgo.com search engine, you could either get mechanize to go to duckduckgo.com, input text into the search box, and click submit, or you could just create a GET request to http://www.duckduckgo.com?q=search_term_here.
You can use Mechanize for something like this but it might be overkill. I would take a look at RestClient, especially if you don't need to manage cookies.
Edit:
If you can determine the specific URL that the form submits to, say for example 'example.com/search'; and you knew the request was a POST (which it usually is if you are submitting a form) you could construct something like this with mechanize:
agent = Mechanize.new
agent.post 'http://example.com/search', {
"_id0:Number" => string_to_search_for,
"_id0:submitButton" => "Enter"
}
Notice how the 'name' attribute of a form element becomes a key for the post and the 'value' element becomes the value. The 'input' element gets the value directly from the text you would have entered. This gets transformed into a request and submitted to the server when you push the submit button (of course in this case you are making the request directly). The result of the post should be some HTML that you can parse for the info you need.

Redirect from current page to a new page

I am having trouble with some Ruby CGI.
I have a home page (index.cgi) which is a mix of HTML and Ruby, and has a login form in it.
On clicking on the Submit button the POST's action is the same page (index.cgi), at which point I check to make sure the user has entered data into the correct fields.
I have a counter which increases by 1 each time a field is left empty. If this counter is 0 I want to change the current loaded page to something like contents.html.
With this I have:
if ( errorCount > 0 )
do nothing
else
....
end
What do I need to put where I have the ....?
Unfortunately I cannot use any frameworks as this is for University coursework, so have to use base Ruby.
As for using the CGI#header method as you have suggested, I have tried using this however it is not working for me.
As mentioned my page is index.cgi. This is made of a mixture of Ruby and HTML using "here doc" statements.
At the top of my code page I have my shebang line, following by a HTML header statement.
I then do the CGI form validation part, and within this I have tried doing something like: print this.cgi( { 'Status' => '302 Moved', 'location' =>
'{http://localhost:10000/contents.html' } )
All that happens is that this line is printed at the top of the browser window, above my index.cgi page.
I hope this makes sense.
To redirect the browser to another URL you must output an 30X HTTP response that contains the Location: /foo/bar header. You can do that using the CGI#header method.
Instead of dealing with these details that you do not yet master, I suggest you use a simple framework as Sinatra or, at least, write your script as a Rack-compatible application.
If you really need to use the bare CGI class, have a look at this simple example: https://github.com/tdtds/amazon-auth-proxy/blob/master/amazon-auth-proxy.cgi.

Manual POST request

Scenario: I have logged into a website, gained cookies etc, got to a particular webpage with a form + hidden fields. I now want to be able to create my own http post with my own hidden form data instead of what is on the webpage and verify the response instead of using the one on the webpage.
Reason: Testing against pre-existing data (I know, I know) which could be different on each environment hence no predictable way to use it. We need a workaround.
Is there any way to do this without manually editing the existing form and submitting that? Feels a little 'hacky'.
Ideally, I would like to say something like:
browser.post 'url', 'field1=test&field2=abc'
I would probably switch to mechanize to muck around at the protocol level. Something like this added to your script
b = WWW::Mechanize.new
b.get('http://yoursite.com/current_page') do |page|
# Submit the login form
my_form = page.form_with(:action => '/post/url') do |f|
f.form_loginname = 'tim'
f.form_pw = 'password'
end.click_button
end

Resources