ruby and net/http request without content-type - ruby

I'm trying to make a call to a Tika server using Net::HTTP::Put. The issue is that the call always passes the Content-Type, which keeps Tika from running the detectors (which I want) and then chokes due to the default Content-Type of application/x-www-form-urlencoded. Tika docs suggest to not use that.
So, I have the following:
require 'net/http'
port = 9998
host = "localhost"
path = "/meta"
req = Net::HTTP::Put.new(path)
req.body_stream = File.open(file_name)
req['Transfer-Encoding'] = 'chunked'
req['Accept'] = 'application/json'
response = Net::HTTP.new(host, port).start { |http|
http.request(req)
}
I tried adding req.delete('content-type') and setting initheaders = {} in various ways, but the default content-type keeps getting sent.
Any insights would be greatly appreciated, since I would rather avoid having to make multiple curl calls ... is there any way to suppress the sending of that default header?

If you set req['Content-Type'] = nil then Net::HTTP will set it to the default of 'application/x-www-form-urlencoded', but if you set it to a blank string Net::HTTP leaves it alone:
req['Content-Type'] = ''
Tika should see that as an invalid type and enable the detectors.

It seems that Tika will run the detectors if the Content-Type is application/octet-stream. Adding
req.content_type = "application/octet-stream"
is now allowing me to get results.

Related

Proper way to upload a doc to FSCrawler for indexing in Elasticsearch

I'm prototyping a Rails application to upload documents to FSCrawler (running the REST interface), to incorporate into an Elasticsearch index. Using their example, this works:
response = `curl -F "file=##{params[:document][:upload].tempfile.path}" "http://127.0.0.1:8080/fscrawler/_upload?debug=true"`
The file gets uploaded, and the content gets indexed. This is an example of what I get:
"{\n \"ok\" : true,\n \"filename\" : \"RackMultipart20200130-91061-16swulg.pdf\",\n \"url\" : \"http://127.0.0.1:9200/local/_doc/d661edecf3e28572676e97a6f0d1d\",\n \"doc\" : {\n \"content\" : \"\\n \\n \\n\\nBasically, what you need to know is that Dante is all IP-based, and makes use of common IT standards. Each Dante device behaves \\n\\nmuch like any other network device you would already find on your network. \\n\\nIn order to make integration into an existing network easy, here are some of the things that Dante does: \\n\\n▪ Dante...
When I run curl at the command line, I get EVERYTHING, like the "filename" being properly set. If I use it as above, in the Rails controller, as you can see, the filename is set to the Tempfile's filename. That's not a workable solution. Trying to use params[:document][:upload].tempfile (without .path) or just params[:document][:upload] both fail entirely.
I'm trying to do this "the right way," but every incarnation of using a proper HTTP client to do this fails. I can't figure out how to invoke an HTTP POST that will submit a file to FSCrawler the way curl (on the command line) does it.
In this example, I'm just trying to send the file by using the Tempfile file object. For some reason, FSCrawler gives me the error in the comment, and get a little metadata, but no content is indexed:
## Failed to extract [100000] characters of text for ...
## org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
uri = URI("http://127.0.0.1:8080/fscrawler/_upload?debug=true")
request = Net::HTTP::Post.new(uri)
form_data = [['file', params[:document][:upload].tempfile,
{ filename: params[:document][:upload].original_filename,
content_type: params[:document][:upload].content_type }]]
request.set_form form_data, 'multipart/form-data'
response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(request)
end
If I change the above to use params[:document][:upload].tempfile.path, then I don't get the error about the InputStream, but I also (still) do not get any content indexed. This is an example of what I get:
{"_index":"local","_type":"_doc","_id":"72c9ecf2a83440994eb87d28786e6","_version":3,"_seq_no":26,"_primary_term":1,"found":true,"_source":{"content":"/var/folders/bn/pcc1h8p16tl534pw__fdz2sw0000gn/T/RackMultipart20200130-91061-134tcxn.pdf\n","meta":{},"file":{"extension":"pdf","content_type":"text/plain; charset=ISO-8859-1","indexing_date":"2020-01-30T15:33:45.481+0000","filename":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf"},"path":{"virtual":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf","real":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf"}}}
If I try to use RestClient, and I try send the file by referencing the actual path to the Tempfile, then I get this error message, and I get nothing:
## Unsupported media type
response = RestClient.post 'http://127.0.0.1:8080/fscrawler/_upload?debug=true',
file: params[:document][:upload].tempfile.path,
content_type: params[:document][:upload].content_type
If I try to .read() the file, and submit that, then I break the FSCrawler form:
## Internal server error
request = RestClient::Request.new(
:method => :post,
:url => 'http://127.0.0.1:8080/fscrawler/_upload?debug=true',
:payload => {
:multipart => true,
:file => File.read(params[:document][:upload].tempfile),
:content_type => params[:document][:upload].content_type
})
response = request.execute
Obviously, I've been trying this every way I can, but I can't replicate whatever curl is doing with any known Ruby-based HTTP clients. I'm utterly lost as to how to get Ruby to submit data to FSCrawler in a way that will get the document contents indexed properly. I've been at this far longer than I care to admit. What am I missing here?
I finally tried Faraday, and, based on this answer, came up with the following:
connection = Faraday.new('http://127.0.0.1:8080') do |f|
f.request :multipart
f.request :url_encoded
f.adapter :net_http
end
file = Faraday::UploadIO.new(
params[:document][:upload].tempfile.path,
params[:document][:upload].content_type,
params[:document][:upload].original_filename
)
payload = { :file => file }
response = connection.post('/fscrawler/_upload', payload)
Using Fiddler helped me to see the results of my attempts, as I got closer and closer to the curl request. This snippet posts the request almost exactly as curl does. To route this call through the proxy, I just needed to add , proxy: 'http://localhost:8866' to the end of the connection setup.

Getting assignees for a set of tasks

I'm developing a tool to assign a bunch of tasks to a guy according to some criteria.
I fetch tasks for a given tag.
I only assign a task to a guy if the task has no assignee.
My problem comes with the last statement. Fetching a list of tasks do not provide enough information. Going through the documentation, I remember I can format the response with the fields I need using opt_fields but I don't succeed to implement it.
I have this piece of code:
# set up HTTPS connection
uri = URI.parse("https://app.asana.com/api/1.0/tags/8232053093879/tasks?opt_fields=name,assignee")
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
# set up the request
header = { "Content-Type" => "application/json" }
request = Net::HTTP::Get.new(uri.path, header)
request.basic_auth(AppConfig[:api_key], '')
# issue the request
response = http.start { |h| h.request(request) }
# output
body = JSON.parse(response.body)
And it keeps responding with:
{"id"=>8232053093904, "name"=>"Implement open VPN"}
{"id"=>8232053093899, "name"=>"Implement a #emi tool for random task affectation."}
{"id"=>8232053093893, "name"=>"List possibilities for internal server hosting ?"}
{"id"=>8232053093890, "name"=>"Create a server FAQ (how to access, how to restart an app, how to set up a new server)"}
{"id"=>8232053093883, "name"=>"Help alban debug munin configuration (server monitoring tool)"}
{"id"=>8232053093876, "name"=>"Think how to improve nanoc deployment"}
While, using curl:
curl -u 8NYknPS.aMxj55LsWwwujpZgNqQ078xf: "https://app.asana.com/api/1.0/tags/8232053093879/tasks?opt_fields=name,assignee"
I get:
{"data":[
{"id":8232053093904,"name":"Implement open VPN","assignee":null},
{"id":8232053093899,"name":"Implement a #emi tool for random task affectation.","assignee":null},
{"id":8232053093893,"name":"List possibilities for internal server hosting ?","assignee":{"id":1069528343983}},
{"id":8232053093890,"name":"Create a server FAQ (how to access, how to restart an app, how to set up a new server)","assignee":null},
{"id":8232053093883,"name":"Help alban debug munin configuration (server monitoring tool)","assignee":{"id":1069528343983}},
{"id":8232053093876,"name":"Think how to improve nanoc deployment","assignee":{"id":753180655981}}
]}
What am I missing?
You may need to give the nested fields, like ?opt_fields=name,assignee,assignee.id - it's a bit clunky, unfortunately. But if you just want the whole assignee, you can use ?opt_expand=assignee.
Hope that helps!

Connect to Microsoft Push Notification Service for Windows Phone 8 from Ruby

We are developing a WP8 app that requires push notifications.
To test it we have run the push notification POST request with CURL command line, making sure that it actually connects, authenticates with the client SSL certificate and sends the correct data. We know for a fact that this work as we are receiving pushes to the devices.
This is the CURL command we have been using for testing purposes:
curl --cert client_cert.pem -v -H "Content-Type:text/xml" -H "X-WindowsPhone-Target:Toast" -H "X-NotificationClass:2" -X POST -d "<?xml version='1.0' encoding='utf-8'?><wp:Notification xmlns:wp='WPNotification'><wp:Toast><wp:Text1>My title</wp:Text1><wp:Text2>My subtitle</wp:Text2></wp:Toast></wp:Notification>" https://db3.notify.live.net/unthrottledthirdparty/01.00/AAF9MBULkDV0Tpyj24I3bzE3AgAAAAADCQAAAAQUZm52OkE1OUZCRDkzM0MyREY1RkE
Of course our SSL cert is needed to actually use the URL, but I was hoping someone else has done this and can see what we are doing wrong.
Now, our problem is that we need to make this work with Ruby instead, something we have been unable to get to work so far.
We have tried using HTTParty with no luck, and also net/http directly without any luck.
Here is a very simple HTTParty test script I have used to test with:
require "httparty"
payload = "<?xml version='1.0' encoding='utf-8'?><wp:Notification xmlns:wp='WPNotification'><wp:Toast><wp:Text1>My title</wp:Text1><wp:Text2>My subtitle</wp:Text2></wp:Toast></wp:Notification>"
uri = "https://db3.notify.live.net/unthrottledthirdparty/01.00/AAF9MBULkDV0Tpyj24I3bzE3AgAAAAADCQAAAAQUZm52OkE1OUZCRDkzM0MyREY1RkE"
opts = {
body: payload,
headers: {
"Content-Type" => "text/xml",
"X-WindowsPhone-Target" => "Toast",
"X-NotificationClass" => "2"
},
debug_output: $stderr,
pem: File.read("/Users/kenny/Desktop/client_cert.pem"),
ca_file: File.read('/usr/local/opt/curl-ca-bundle/share/ca-bundle.crt')
}
resp = HTTParty.post uri, opts
puts resp.code
This seems to connect with SSL properly, but then the MS IIS server returns 403 to us for some reason we don't get.
Here is essentially the same thing I've tried using net/http:
require "net/http"
url = URI.parse "https://db3.notify.live.net/unthrottledthirdparty/01.00/AAF9MBULkDV0Tpyj24I3bzE3AgAAAAADCQAAAAQUZm52OkE1OUZCRDkzM0MyREY1RkE"
payload = "<?xml version='1.0' encoding='utf-8'?><wp:Notification xmlns:wp='WPNotification'><wp:Toast><wp:Text1>My title</wp:Text1><wp:Text2>My subtitle</wp:Text2></wp:Toast></wp:Notification>"
pem_path = "./client_cert.pem"
cert = File.read pem_path
http = Net::HTTP.new url.host, url.port
http.use_ssl = true
http.cert = OpenSSL::X509::Certificate.new cert
http.key = OpenSSL::PKey::RSA.new cert
http.ca_path = '/etc/ssl/certs' if File.exists?('/etc/ssl/certs') # Ubuntu
http.ca_file = '/usr/local/opt/curl-ca-bundle/share/ca-bundle.crt' if File.exists?('/usr/local/opt/curl-ca-bundle/share/ca-bundle.crt') # Mac OS X
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
r = Net::HTTP::Post.new url.path
r.body = payload
r.content_type = "text/xml"
r["X-WindowsPhone-Target"] = "toast"
r["X-NotificationClass"] = "2"
http.start do
resp = http.request r
puts resp.code, resp.body
end
Like the HTTParty version, this also returns 403..
I'm starting to get the feeling that this won't actually work with net/http, but I've also seen a few examples of code claiming to work, but I can't see any difference compared to what we have tested with here.
Does anyone know how to fix this? Is it possible? Should I use libcurl instead perhaps? Or even do a system call to curl? (I may have to do the last one as an interim solution if we can't get this to work soon).
Any input is greatly appreciated!
Thanks,
Kenny
Try using some tool like http://mitmproxy.org to compare requests from your code and curl.
For example curl in addition to specified headers does send User-Agent and Accept-headers, microsoft servers may be checking for these for some reason.
If this does not help - then it's ssl-related

Making a URL in a string usable by Ruby's Net::HTTP

Ruby's Net:HTTP needs to be given a full URL in order for it to connect to the server and get the file properly. By "full URL" I mean a URL including the http:// part and the trailing slash if it needs it. For instance, Net:HTTP won't connect to a URL looking like this: example.com, but will connect just fine to http://example.com/. Is there any way to make sure a URL is a full URL, and add the required parts if it isn't?
EDIT: Here is the code I am using:
parsed_url = URI.parse(url)
req = Net::HTTP::Get.new(parsed_url.path)
res = Net::HTTP.start(parsed_url.host, parsed_url.port) {|http|
http.request(req)
}
If this is only doing what the sample code shows, Open-URI would be an easier approach.
require 'open-uri'
res = open(url).read
This would do a simple check for http/https:
if !(url =~ /^https?:/i)
url = "http://" + url
end
This could be a more general one to handle multiple protocols (ftp, etc.)
if !(url =~ /^\w:/i)
url = "http://" + url
end
In order to make sure parsed_url.path gives you a proper value (it should be / when no specific path was provided), you could do something like this:
req = Net::HTTP::Get.new(parsed_url.path.empty? ? '/' : parsed_url.path)

How to pass cookies from one page to another using curl in Ruby?

I am doing a video crawler in ruby. In there I have to log in to a page by enabling cookies and download pages. For that I am using the CURL library in ruby. I can successfully log in, but I can't download the pages inside that with curl. How can I fix this or download the pages otherwise?
My code is
curl = Curl::Easy.new(1st url)
curl.follow_location = true
curl.enable_cookies = true
curl.cookiefile = "cookie.txt"
curl.cookiejar = "cookie.txt"
curl.http_post(1st url,field)
curl.perform
curl = Curl::Easy.perform(2nd url)
curl.follow_location = true
curl.enable_cookies = true
curl.cookiefile = "cookie.txt"
curl.cookiejar = "cookie.txt"
curl.http_get
code = curl.body_str
What I've seen in writing my own similar "post-then-get" script is that ruby/Curb (I'm using version 0.7.15 with ruby 1.8) seems to ignore the cookiejar/cookiefile fields of a Curl::Easy object. If I set either of those fields and the http_post completes successfully, no cookiejar or cookiefile file is created. Also, curl.cookies will still be nil after your curl.http_post, however, the cookies ARE set within the curl object. I promise :)
I think where you're going wrong is here:
curl = Curl::Easy.perform(2nd url)
The curb documentation states that this creates a new object. That new object doesn't have any of your existing cookies set. If you change your code to look like the following, I believe it should work. I've also removed the curl.perform for the first url since curl.http_post already implicitly does the "perform". You were basically http_post'ing twice before trying your http_get.
curl = Curl::Easy.new(1st url)
curl.follow_location = true
curl.enable_cookies = true
curl.http_post(1st url,field)
curl.url = 2nd url
curl.http_get
code = curl.body_str
If this still doesn't seem to be working for you, you can verify if the cookie is getting set by adding
curl.verbose = true
Before
curl.http_post
Your Curl::Easy object will dump all the headers that it gets in the response from the server to $stdout, and somewhere in there you should see a line stating that it added/set a cookie. I don't have any example output right now but I'll try to post a follow-up soon.
HTTPClient automatically enables cookies, as does Mechanize.
From the HTTPClient docs:
clnt = HTTPClient.new
clnt.get_content(url1) # receives Cookies.
clnt.get_content(url2) # sends Cookies if needed.
Posting a form is easy too:
body = { 'keyword' => 'ruby', 'lang' => 'en' }
res = clnt.post(uri, body)
Mechanize makes this sort of thing really simple (It will handle storing the cookies, among other things).

Resources