Ruby form auto submission Mechanize::ResponseCodeError - ruby

submit_form = agent.get("http://sample.com/NewTask.aspx").form("aspnetForm") do |f|
f["ctl00$ContentPlaceHolder1$txtNumber"] = "1234",
f["ctl00$ContentPlaceHolder1$cmbText"] = "test",
f["ctl00$ContentPlaceHolder1$FUpload$fu"] = "",
f["ctl00$ContentPlaceHolder1$btn"] = "test"
f.submit(f.button_with(:name => "ctl00$ContentPlaceHolder1$btnOK"))
end
This is the code I wrote for the form auto submission using the mechanize lib for Ruby, it came back with Mechanize::ResponseCodeError as follow. I really don't see any error in my code, anyone could kindly let me know if this is a code error or something on the server side (say server prevents form auto submission)?
C:/Ruby193/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:29
1:in fetch': 500 => Net::HTTPInternalServerError for http://sample.com/NewTask.aspx -- unhandled response (Mechanize::ResponseCodeError)
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize.rb:1207:inpost_form'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize.rb:515:in submit'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/form.rb:178:insubmit'
from auto_post.rb:27:in block in <main>'
from (eval):23:inform_with'
from auto_post.rb:13:in `'

You need to proxy through a debugging proxy like fiddler or charles:
agent.set_proxy 'localhost', 8888
then proxy your browser similarly and compare the requests

Related

Proper way to upload a doc to FSCrawler for indexing in Elasticsearch

I'm prototyping a Rails application to upload documents to FSCrawler (running the REST interface), to incorporate into an Elasticsearch index. Using their example, this works:
response = `curl -F "file=##{params[:document][:upload].tempfile.path}" "http://127.0.0.1:8080/fscrawler/_upload?debug=true"`
The file gets uploaded, and the content gets indexed. This is an example of what I get:
"{\n \"ok\" : true,\n \"filename\" : \"RackMultipart20200130-91061-16swulg.pdf\",\n \"url\" : \"http://127.0.0.1:9200/local/_doc/d661edecf3e28572676e97a6f0d1d\",\n \"doc\" : {\n \"content\" : \"\\n \\n \\n\\nBasically, what you need to know is that Dante is all IP-based, and makes use of common IT standards. Each Dante device behaves \\n\\nmuch like any other network device you would already find on your network. \\n\\nIn order to make integration into an existing network easy, here are some of the things that Dante does: \\n\\n▪ Dante...
When I run curl at the command line, I get EVERYTHING, like the "filename" being properly set. If I use it as above, in the Rails controller, as you can see, the filename is set to the Tempfile's filename. That's not a workable solution. Trying to use params[:document][:upload].tempfile (without .path) or just params[:document][:upload] both fail entirely.
I'm trying to do this "the right way," but every incarnation of using a proper HTTP client to do this fails. I can't figure out how to invoke an HTTP POST that will submit a file to FSCrawler the way curl (on the command line) does it.
In this example, I'm just trying to send the file by using the Tempfile file object. For some reason, FSCrawler gives me the error in the comment, and get a little metadata, but no content is indexed:
## Failed to extract [100000] characters of text for ...
## org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
uri = URI("http://127.0.0.1:8080/fscrawler/_upload?debug=true")
request = Net::HTTP::Post.new(uri)
form_data = [['file', params[:document][:upload].tempfile,
{ filename: params[:document][:upload].original_filename,
content_type: params[:document][:upload].content_type }]]
request.set_form form_data, 'multipart/form-data'
response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(request)
end
If I change the above to use params[:document][:upload].tempfile.path, then I don't get the error about the InputStream, but I also (still) do not get any content indexed. This is an example of what I get:
{"_index":"local","_type":"_doc","_id":"72c9ecf2a83440994eb87d28786e6","_version":3,"_seq_no":26,"_primary_term":1,"found":true,"_source":{"content":"/var/folders/bn/pcc1h8p16tl534pw__fdz2sw0000gn/T/RackMultipart20200130-91061-134tcxn.pdf\n","meta":{},"file":{"extension":"pdf","content_type":"text/plain; charset=ISO-8859-1","indexing_date":"2020-01-30T15:33:45.481+0000","filename":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf"},"path":{"virtual":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf","real":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf"}}}
If I try to use RestClient, and I try send the file by referencing the actual path to the Tempfile, then I get this error message, and I get nothing:
## Unsupported media type
response = RestClient.post 'http://127.0.0.1:8080/fscrawler/_upload?debug=true',
file: params[:document][:upload].tempfile.path,
content_type: params[:document][:upload].content_type
If I try to .read() the file, and submit that, then I break the FSCrawler form:
## Internal server error
request = RestClient::Request.new(
:method => :post,
:url => 'http://127.0.0.1:8080/fscrawler/_upload?debug=true',
:payload => {
:multipart => true,
:file => File.read(params[:document][:upload].tempfile),
:content_type => params[:document][:upload].content_type
})
response = request.execute
Obviously, I've been trying this every way I can, but I can't replicate whatever curl is doing with any known Ruby-based HTTP clients. I'm utterly lost as to how to get Ruby to submit data to FSCrawler in a way that will get the document contents indexed properly. I've been at this far longer than I care to admit. What am I missing here?
I finally tried Faraday, and, based on this answer, came up with the following:
connection = Faraday.new('http://127.0.0.1:8080') do |f|
f.request :multipart
f.request :url_encoded
f.adapter :net_http
end
file = Faraday::UploadIO.new(
params[:document][:upload].tempfile.path,
params[:document][:upload].content_type,
params[:document][:upload].original_filename
)
payload = { :file => file }
response = connection.post('/fscrawler/_upload', payload)
Using Fiddler helped me to see the results of my attempts, as I got closer and closer to the curl request. This snippet posts the request almost exactly as curl does. To route this call through the proxy, I just needed to add , proxy: 'http://localhost:8866' to the end of the connection setup.

Ruby hipchat gem invalid send file

So this is related to an earlier post I made on this method. This is essentially what I am using to send files via hipchat:
#!/usr/bin/env ruby
require 'hipchat'
client = HipChat::Client.new('HIPCHAT_TOKEN', :api_version => 'v2', :server_url => 'HIPCHAT_URL')
client.user('some_username').send_file('message', File.open('./output/some-file.csv') )
client['some_hipchat_room'].send_file('some_user', 'message', File.open('./output/some-file.csv') )
Now for some reason the send_file method is invalid:
/path/to/gems/hipchat-1.5.4/lib/hipchat/errors.rb:40:in `response_code_to_exception_for': You requested an invalid method. path:https://hipchat.illum.io/v2/user/myuser#myemail/share/file?auth_token=asdfgibberishasdf method:Net::HTTP::Get (HipChat::MethodNotAllowed)
from /path/to/gems/gems/hipchat-1.5.4/lib/hipchat/user.rb:50:in `send_file'
I think this indicating that you should be using POST instead of GET, but I'm not sure because I haven't used this library nor Hipchat.
Looking at the question you referenced and the source posted by another user they're sending the request using self.class.post, and your debug output shows Net::HTTP::Get
To debug, could you try,
file = Tempfile.new('foo').tap do |f|
f.write("the content")
f.rewind
end
user = client.user(some_username)
user.send_file('some bytes', file)
The issue is that I was attempting to connect to the server via http instead of https. If the following client is causing issues:
client = HipChat::Client.new('HIPCHAT_TOKEN', :api_version => 'v2', :server_url => 'my.company.com')
Then try adding https:// to the beginning of your company's name.
client = HipChat::Client.new('HIPCHAT_TOKEN', :api_version => 'v2', :server_url => 'https://my.company.com')

Crawl data using ruby mechanize

I am crawling data from http://www.mca.gov.in/DCAPortalWeb/dca/MyMCALogin.do?method=setDefaultProperty&mode=53
Below is the code I have tried :
uri = "http://www.mca.gov.in/DCAPortalWeb/dca/MyMCALogin.do?method=setDefaultProperty&mode=53"
#html, html_content = #mobj.get_data(uri)
agent = Mechanize.new
html_page = agent.get uri
html_form = html_page.form
html_form.radiobuttons_with(:name => 'search',:value => '2')[0].check
html_form.submit
puts html_page.content
Error :
var/lib/gems/1.9.1/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:308:in `fetch': 500 => Net::HTTPInternalServerError for http://www.mca.gov.in/DCAPortalWeb/dca/ProsecutionDetailsSRAction.do -- unhandled response (Mechanize::ResponseCodeError)
from /var/lib/gems/1.9.1/gems/mechanize-2.7.3/lib/mechanize.rb:1281:in `post_form'
from /var/lib/gems/1.9.1/gems/mechanize-2.7.3/lib/mechanize.rb:548:in `submit'
from /var/lib/gems/1.9.1/gems/mechanize-2.7.3/lib/mechanize/form.rb:223:in `submit'
from ministry_corp_aff.rb:32:in `start'
from ministry_corp_aff.rb:52:in `<main>'
If I manually click on the 3rd radio button and then submit it, I get a .zip file. I was trying to fetch data from the .xls file from that zip..
The radio button has an onclick even handler that triggers the execution of some javascript. In addition, clicking on the Submit <a> tag also causes some javascript to execute. That javascript probably sets some values that are returned with the form, which the server examines.
Mechanize cannot execute the javascript. You need selenium webdriver for that.

Good request from browser but bad request from ruby?

I'm using the google custom search api and I'm trying to access it through some ruby code:
Here is a snippet of the code
req = Typhoeus::Request.new("https://www.googleapis.com/customsearch/v1?key={my_key}&cx=017576662512468239146:omuauf_lfve&q=" + keyword, followlocation: true)
res = req.run
It appears that the body of the answer is this one:
<p>Your client has issued a malformed or illegal request. <ins>That’s all we know.</ins>
'
from /usr/local/lib/ruby/2.1.0/json/common.rb:155:in `parse'
from main.rb:20:in `initialize'
from main.rb:41:in `new'
from main.rb:41:in `<main>'
When I try to do the same thing from the browser it works like a charm. Even more confusing is that this same code worked 12 hours ago. I only changed the keyword that it should look for, however it started returning the error.
Any suggestions? I'm sure that I have enough credits for more requests
You probably have problems with special characters in your get parameter keyword. If you enter the URL in your browser, the brower adjusts these. However, for ruby you need to escape these characters, in such a way that a string like "sky line" becomes "sky+line" and so on. There is a utility function CGI::escape, which is used like this:
require 'cgi'
CGI::escape("sky line")
=> "sky+line"
Your fixed code would look something like this:
req = Typhoeus::Request.new("https://www.googleapis.com/customsearch/v1?key={my_key}&cx=017576662512468239146:omuauf_lfve&q=" + CGI::escape(keyword), followlocation: true)
res = req.run
However, since you're using Typhoeus anyway, you should be able to use its params parameter and let Typhoeus handle the escaping:
req = Typhoeus::Request.new(
"https://www.googleapis.com/customsearch/v1?&cx=017576662512468239146:omuauf_lfve",
followlocation: true,
params: {q: keyword, key: my_key}
)
res = req.run
There's more examples on Typhoeus' GitHub page.

`open_http': 403 Forbidden (OpenURI::HTTPError) for the string "Steve_Jobs" but not for any other string

I was going through the Ruby tutorials provided at http://ruby.bastardsbook.com/ and I encountered the following code:
require "open-uri"
remote_base_url = "http://en.wikipedia.org/wiki"
r1 = "Steve_Wozniak"
r2 = "Steve_Jobs"
f1 = "my_copy_of-" + r1 + ".html"
f2 = "my_copy_of-" + r2 + ".html"
# read the first url
remote_full_url = remote_base_url + "/" + r1
rpage = open(remote_full_url).read
# write the first file to disk
file = open(f1, "w")
file.write(rpage)
file.close
# read the first url
remote_full_url = remote_base_url + "/" + r2
rpage = open(remote_full_url).read
# write the second file to disk
file = open(f2, "w")
file.write(rpage)
file.close
# open a new file:
compiled_file = open("apple-guys.html", "w")
# reopen the first and second files again
k1 = open(f1, "r")
k2 = open(f2, "r")
compiled_file.write(k1.read)
compiled_file.write(k2.read)
k1.close
k2.close
compiled_file.close
The code fails with the following trace:
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:277:in `open_http': 403 Forbidden (OpenURI::HTTPError)
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:616:in `buffer_open'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:164:in `open_loop'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:162:in `catch'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:162:in `open_loop'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:132:in `open_uri'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:518:in `open'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:30:in `open'
from /Users/arkidmitra/tweetfetch/samecode.rb:11
My problem is not that the code fails but that whenever I change r2 to anything other than Steve_Jobs, it works. What is happening here?
Your code runs fine for me (Ruby MRI 1.9.3) when I request a wiki page that exists.
When I request a wiki page that does NOT exist, I get a mediawiki 404 error code.
Steve_Jobs => success
Steve_Austin => success
Steve_Rogers => success
Steve_Foo => error
Wikipedia does a ton of caching, so if you see reponses for "Steve_Jobs" that are different than other people who do exist, then best-guess this is because wikipedia is caching the Steve Jobs article because he's famous, and potentially adding extra checks/verifications to protect the article from rapid changes, defacings, etc.
The solution for you: always open the url with a User Agent string.
rpage = open(remote_full_url, "User-Agent" => "Whatever you want here").read
Details from the Mediawiki docs: "When you make HTTP requests to the MediaWiki web service API, be sure to specify a User-Agent header that properly identifies your client. Don't use the default User-Agent provided by your client library, but make up a custom header that includes the name and the version number of your client: something like "MyCuteBot/0.1".
On Wikimedia wikis, if you don't supply a User-Agent header, or you supply an empty or generic one, your request will fail with an HTTP 403 error. See our User-Agent policy."
I think this happens for locked down entries like "Steve Jobs", "Al-Gore" etc. This is specified in the same book that you are referring to:
For some pages – such as Al Gore's locked-down entry – Wikipedia will
not respond to a web request if a User-Agent isn't specified. The
"User-Agent" typically refers to your browser, and you can see this by
inspecting the headers you send for any page request in your browser.
By providing a "User-Agent" key-value pair, (I basically use "Ruby"
and it seems to work), we can pass it as a hash (I use the constant
HEADERS_HASH in the example) as the second argument of the method
call.
It is specified later at http://ruby.bastardsbook.com/chapters/web-crawling/

Resources