Proper way to upload a doc to FSCrawler for indexing in Elasticsearch - ruby

I'm prototyping a Rails application to upload documents to FSCrawler (running the REST interface), to incorporate into an Elasticsearch index. Using their example, this works:
response = `curl -F "file=##{params[:document][:upload].tempfile.path}" "http://127.0.0.1:8080/fscrawler/_upload?debug=true"`
The file gets uploaded, and the content gets indexed. This is an example of what I get:
"{\n \"ok\" : true,\n \"filename\" : \"RackMultipart20200130-91061-16swulg.pdf\",\n \"url\" : \"http://127.0.0.1:9200/local/_doc/d661edecf3e28572676e97a6f0d1d\",\n \"doc\" : {\n \"content\" : \"\\n \\n \\n\\nBasically, what you need to know is that Dante is all IP-based, and makes use of common IT standards. Each Dante device behaves \\n\\nmuch like any other network device you would already find on your network. \\n\\nIn order to make integration into an existing network easy, here are some of the things that Dante does: \\n\\n▪ Dante...
When I run curl at the command line, I get EVERYTHING, like the "filename" being properly set. If I use it as above, in the Rails controller, as you can see, the filename is set to the Tempfile's filename. That's not a workable solution. Trying to use params[:document][:upload].tempfile (without .path) or just params[:document][:upload] both fail entirely.
I'm trying to do this "the right way," but every incarnation of using a proper HTTP client to do this fails. I can't figure out how to invoke an HTTP POST that will submit a file to FSCrawler the way curl (on the command line) does it.
In this example, I'm just trying to send the file by using the Tempfile file object. For some reason, FSCrawler gives me the error in the comment, and get a little metadata, but no content is indexed:
## Failed to extract [100000] characters of text for ...
## org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
uri = URI("http://127.0.0.1:8080/fscrawler/_upload?debug=true")
request = Net::HTTP::Post.new(uri)
form_data = [['file', params[:document][:upload].tempfile,
{ filename: params[:document][:upload].original_filename,
content_type: params[:document][:upload].content_type }]]
request.set_form form_data, 'multipart/form-data'
response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(request)
end
If I change the above to use params[:document][:upload].tempfile.path, then I don't get the error about the InputStream, but I also (still) do not get any content indexed. This is an example of what I get:
{"_index":"local","_type":"_doc","_id":"72c9ecf2a83440994eb87d28786e6","_version":3,"_seq_no":26,"_primary_term":1,"found":true,"_source":{"content":"/var/folders/bn/pcc1h8p16tl534pw__fdz2sw0000gn/T/RackMultipart20200130-91061-134tcxn.pdf\n","meta":{},"file":{"extension":"pdf","content_type":"text/plain; charset=ISO-8859-1","indexing_date":"2020-01-30T15:33:45.481+0000","filename":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf"},"path":{"virtual":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf","real":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf"}}}
If I try to use RestClient, and I try send the file by referencing the actual path to the Tempfile, then I get this error message, and I get nothing:
## Unsupported media type
response = RestClient.post 'http://127.0.0.1:8080/fscrawler/_upload?debug=true',
file: params[:document][:upload].tempfile.path,
content_type: params[:document][:upload].content_type
If I try to .read() the file, and submit that, then I break the FSCrawler form:
## Internal server error
request = RestClient::Request.new(
:method => :post,
:url => 'http://127.0.0.1:8080/fscrawler/_upload?debug=true',
:payload => {
:multipart => true,
:file => File.read(params[:document][:upload].tempfile),
:content_type => params[:document][:upload].content_type
})
response = request.execute
Obviously, I've been trying this every way I can, but I can't replicate whatever curl is doing with any known Ruby-based HTTP clients. I'm utterly lost as to how to get Ruby to submit data to FSCrawler in a way that will get the document contents indexed properly. I've been at this far longer than I care to admit. What am I missing here?

I finally tried Faraday, and, based on this answer, came up with the following:
connection = Faraday.new('http://127.0.0.1:8080') do |f|
f.request :multipart
f.request :url_encoded
f.adapter :net_http
end
file = Faraday::UploadIO.new(
params[:document][:upload].tempfile.path,
params[:document][:upload].content_type,
params[:document][:upload].original_filename
)
payload = { :file => file }
response = connection.post('/fscrawler/_upload', payload)
Using Fiddler helped me to see the results of my attempts, as I got closer and closer to the curl request. This snippet posts the request almost exactly as curl does. To route this call through the proxy, I just needed to add , proxy: 'http://localhost:8866' to the end of the connection setup.

Related

troubles generating signature for alibaba cloud

Reading the HTTP API docs. My requests fail though for bad signature. From error message I can see that my string to sign is correct but looks like I can't generate the correct HMAC-SHA1 (seriously why use SHA1 still??).
So I decided to try replicate the signature of the sample inside same document.
[47] pry(main)> to_sign = "GET&%2F&AccessKeyId%3Dtestid&Action%3DDescribeRegions&Format%3DXML&SignatureMethod%3DHMAC-SHA1&SignatureNonce%3D3ee8c1b8-83d3-44af-a94f-4e0ad82fd6cf&SignatureVersion%3D1.0&Timestamp%3D2016-02-23T12%253A46%253A24Z&Version%3D2014-05-26"
[48] pry(main)> Base64.encode64 OpenSSL::HMAC.digest("sha1", "testsecret", to_sign)
=> "MLAxpXej4jJ7TL0smgWpOgynR7s=\n"
[49] pry(main)> Base64.encode64 OpenSSL::HMAC.digest("sha1", "testsecret&", to_sign)
=> "VyBL52idtt+oImX0NZC+2ngk15Q=\n"
[50] pry(main)> Base64.encode64 OpenSSL::HMAC.hexdigest("sha1", "testsecret&", to_sign)
=> "NTcyMDRiZTc2ODlkYjZkZmE4MjI2NWY0MzU5MGJlZGE3ODI0ZDc5NA==\n"
[51] pry(main)> Base64.encode64 OpenSSL::HMAC.hexdigest("sha1", "testsecret", to_sign)
=> "MzBiMDMxYTU3N2EzZTIzMjdiNGNiZDJjOWEwNWE5M2EwY2E3NDdiYg==\n"
[52] pry(main)> OpenSSL::HMAC.hexdigest("sha1", "testsecret&", to_sign)
=> "57204be7689db6dfa82265f43590beda7824d794"
[53] pry(main)> OpenSSL::HMAC.hexdigest("sha1", "testsecret", to_sign)
=> "30b031a577a3e2327b4cbd2c9a05a93a0ca747bb"
As evident none of these matches the example signature of CT9X0VtwR86fNWSnsc6v8YGOjuE=. Any idea what is missing here?
Update: taking tcpdump from the Golang client tool I see that it does a POST request like:
POST /?AccessKeyId=**********&Action=DescribeRegions&Format=JSON&RegionId=cn-qingdao&Signature=aHZVpIMb0%2BFKdoWSIVaFJ7bd2LA%3D&SignatureMethod=HMAC-SHA1&SignatureNonce=c29a0e28964c470a8997aebca4848b57&SignatureType=&SignatureVersion=1.0&Timestamp=2018-07-16T19%3A46%3A33Z&Version=2014-05-26 HTTP/1.1
Host: ecs.aliyuncs.com
User-Agent: Aliyun-CLI-V3.0.3
Content-Length: 0
Content-Type: application/x-www-form-urlencoded
x-sdk-client: golang/1.0.0
x-sdk-core-version: 0.0.1
x-sdk-invoke-type: common
Accept-Encoding: gzip
When I take parameters from the above request and generate signature it does match. So I tried all tree: GET, POST with URL params and POST with params in body. Every time I am getting a signature error. If I redo the request with exact same params as the golang tool, I'm getting nonce already used error (as expected).
Finally got this working. The main issue in my case was that I have been double-percent-encoding the signature parameter thus it turned out invalid. What helped me most was running the aliyun cli utility and capturing traffic, then running a query with exactly the same parameters to compare the exact query string.
But let me list some key points for me:
once hmac-sha1 sig is generated, do not percent-encode it, just add it to the query with normal form www encoding
order of parameters in the HTTP query is not significant; order of parameters in the signing string is significant though
I find all the following types of requests to work: GET, POST with parameters in URL query, POST with parameters in request body form www encoded; I'm using GET per documentation but I see aliyun using POST vs query params and ordered params in the query
you must add & character to the end of the secret key when generating HMAC-SHA1
generate HMAC-SHA1 in binary form, then encode as Base64 (no hex values)
some parameters might be case insensitive, e.g. Format works both as json and JSON
I see aliyun, #wanghq and John using UUID 4 for SignatureNonce but I deferred to plain random (according to docs) because it seems to be only a replay attack protection. So cryptographically secure random number must unnecessary.
The special encoding rules for +, * and ~ seem to only apply to string for signing, not actually to encode data in such a way in the HTTP query.
I decided to not use #wanghq's wrapper as it didn't work for me as well disables certificate validation but maybe it's going to be fixed. Just I thought that queries are simple enough once signature is figured out and an additional layer of indirection is not worth it. +1 to his answer though as it was helpful to get my signature right.
Here's example ruby code to make a simple request:
require 'base64'
require 'cgi'
require 'openssl'
require 'time'
require 'rest-client'
# perform a request against Alibaba Cloud API
# #see https://www.alibabacloud.com/help/doc-detail/25489.htm
def request(action:, params: {})
api_url = "https://ecs.aliyuncs.com/"
# method = "POST"
method = "GET"
process_params!(http: method, action: action, params: params)
RestClient::Request.new(method: method, url: api_url, headers: {params: params})
# RestClient::Request.new(method: method, url: api_url, payload: params)
# RestClient::Request.new(method: method, url: api_url, payload: params.map{|k,v| "#{k}=#{CGI.escape(v)}"}.join("&"))
end
# generates the required common params for a request and adds them to params
# #return undefined
# #see https://www.alibabacloud.com/help/doc-detail/25490.htm
def process_params!(http:, action:, params:)
params.merge!({
"Action" => action,
"AccessKeyId" => config[:auth][:key_id],
"Format" => "JSON",
"Version" => "2014-05-26",
"Timestamp" => Time.now.utc.iso8601
})
sign!(http: http, action: action, params: params)
end
# generate request signature and adds to params
# #return undefined
# #see https://www.alibabacloud.com/help/doc-detail/25492.htm
def sign!(http:, action:, params:)
params.delete "Signature"
params["SignatureMethod"] = "HMAC-SHA1"
params["SignatureVersion"] = "1.0"
params["SignatureNonce"] = "#{rand(1_000_000_000_000)}"
# params["SignatureNonce"] = SecureRandom.uuid.gsub("-", "")
canonicalized_query_string = params.sort.map { |key, value|
"#{key}=#{percent_encode value}"
}.join("&")
string_to_sign = %{#{http}&#{percent_encode("/")}&#{percent_encode(canonicalized_query_string)}}
params["Signature"] = hmac_sha1(string_to_sign)
end
# #param data [String]
# #return [String]
def hmac_sha1(data, secret: config[:auth][:key_secret])
Base64.encode64(OpenSSL::HMAC.digest('sha1', "#{secret}&", data)).strip
end
# encode strings per Alibaba cloud rules for signing
# #return [String] encoded string
# #see https://www.alibabacloud.com/help/doc-detail/25492.htm
def percent_encode(str)
CGI.escape(str).gsub(?+, "%20").gsub(?*, "%2A").gsub("%7E", ?~)
end
## example call
request(action: "DescribeRegions")
Code can be simplified a little but decided to keep it very close to documentation instructions.
P.S. not sure why John deleted his answer but leaving a link above to his web page for any python guys looking for example code
Seems this aliyun ruby sdk (non official, just for reference) works. You may want to check how it's implemented.
Check how its string_to_sign looks like. I did a run and seems it's slightly different than what you provided. The params are concatenated with & instead of %26.
GET&%2F&AccessKeyId%3Dtestid&Action%3DDescribeRegions&Format%3DXML&SignatureMethod%3DHMAC-SHA1&SignatureNonce%3D3ee8c1b8-83d3-44af-a94f-4e0ad82fd6cf&SignatureVersion%3D1.0&Timestamp%3D2016-02-23T12%253A46%253A24Z&Version%3D2014-05-26
require 'rubygems'
require 'aliyun'
$DEBUG = true
options = {
:access_key_id => "k",
:access_key_secret => "s",
:service => :ecs
}
service = Aliyun::Service.new options
puts service.DescribeRegions({})
wanted to share a library I found (Python) that does everything for me w/o the need to sign the request myself.
It can also help those who wants to just copy their functions and still construct the signature on their own
I'm using this:
from aliyunsdkcore.client import AcsClient
from aliyunsdkvpc.request.v20160428.DescribeEipAddressesRequest import DescribeEipAddressesRequest
client = AcsClient(access_key, secret_key, region)
request = DescribeEipAddressesRequest()
request.set_accept_format('json')
response = client.do_action_with_exception(request) # FYI returned as Bytes
print(response)
Each section in Alibaba Cloud has its own library (just like I used: aliyunsdkvpc for EIP addresses)
And they are all listed here:
https://develop.aliyun.com/tools/sdk?#/python

Ruby hipchat gem invalid send file

So this is related to an earlier post I made on this method. This is essentially what I am using to send files via hipchat:
#!/usr/bin/env ruby
require 'hipchat'
client = HipChat::Client.new('HIPCHAT_TOKEN', :api_version => 'v2', :server_url => 'HIPCHAT_URL')
client.user('some_username').send_file('message', File.open('./output/some-file.csv') )
client['some_hipchat_room'].send_file('some_user', 'message', File.open('./output/some-file.csv') )
Now for some reason the send_file method is invalid:
/path/to/gems/hipchat-1.5.4/lib/hipchat/errors.rb:40:in `response_code_to_exception_for': You requested an invalid method. path:https://hipchat.illum.io/v2/user/myuser#myemail/share/file?auth_token=asdfgibberishasdf method:Net::HTTP::Get (HipChat::MethodNotAllowed)
from /path/to/gems/gems/hipchat-1.5.4/lib/hipchat/user.rb:50:in `send_file'
I think this indicating that you should be using POST instead of GET, but I'm not sure because I haven't used this library nor Hipchat.
Looking at the question you referenced and the source posted by another user they're sending the request using self.class.post, and your debug output shows Net::HTTP::Get
To debug, could you try,
file = Tempfile.new('foo').tap do |f|
f.write("the content")
f.rewind
end
user = client.user(some_username)
user.send_file('some bytes', file)
The issue is that I was attempting to connect to the server via http instead of https. If the following client is causing issues:
client = HipChat::Client.new('HIPCHAT_TOKEN', :api_version => 'v2', :server_url => 'my.company.com')
Then try adding https:// to the beginning of your company's name.
client = HipChat::Client.new('HIPCHAT_TOKEN', :api_version => 'v2', :server_url => 'https://my.company.com')

Ruby's Net::HTTP doesn't get the right answer from google OAuth server

I'm writing a small cli tool, that should check my calendar and do some stuff according to my appointments.
I'm struggling a little bit with the OAuth2 authentication. I've checked the scope and the client_id with the curl tool like this:
curl -d "client_id=12345...&scope=scope=https://www.googleapis.com/auth/calendar.readonly" https://accounts.google.com/o/oauth2/device/code
This way, I get the right response.
{
"device_code" : "somestuff",
"user_code" : "otherstuff",
"verification_url" : "http://www.google.com/device",
"expires_in" : 1800,
"interval" : 5
}
But, when I try to use Net::HTTP in Ruby I just get HTTP state 200. I've done it this way:
res = Net::HTTP.post_form(uri, {'client_id' =>'1234....apps.googleusercontent.com', 'scope' => 'https://www.googleapis.com/auth/calendar.readonly' })
If I check the res variable afterwards I get the state 302, but I guess this is correct.
Can someone tell me what I'm, doing wrong so I don't get the JSON response? Should I try something different than Net::HTTP?
res is a variable containing all the response data, not just the text of the response. If you puts res.body after your post_form() call, you should find your JSON (which you can parse with the JSON module).

How to pass cookies from one page to another using curl in Ruby?

I am doing a video crawler in ruby. In there I have to log in to a page by enabling cookies and download pages. For that I am using the CURL library in ruby. I can successfully log in, but I can't download the pages inside that with curl. How can I fix this or download the pages otherwise?
My code is
curl = Curl::Easy.new(1st url)
curl.follow_location = true
curl.enable_cookies = true
curl.cookiefile = "cookie.txt"
curl.cookiejar = "cookie.txt"
curl.http_post(1st url,field)
curl.perform
curl = Curl::Easy.perform(2nd url)
curl.follow_location = true
curl.enable_cookies = true
curl.cookiefile = "cookie.txt"
curl.cookiejar = "cookie.txt"
curl.http_get
code = curl.body_str
What I've seen in writing my own similar "post-then-get" script is that ruby/Curb (I'm using version 0.7.15 with ruby 1.8) seems to ignore the cookiejar/cookiefile fields of a Curl::Easy object. If I set either of those fields and the http_post completes successfully, no cookiejar or cookiefile file is created. Also, curl.cookies will still be nil after your curl.http_post, however, the cookies ARE set within the curl object. I promise :)
I think where you're going wrong is here:
curl = Curl::Easy.perform(2nd url)
The curb documentation states that this creates a new object. That new object doesn't have any of your existing cookies set. If you change your code to look like the following, I believe it should work. I've also removed the curl.perform for the first url since curl.http_post already implicitly does the "perform". You were basically http_post'ing twice before trying your http_get.
curl = Curl::Easy.new(1st url)
curl.follow_location = true
curl.enable_cookies = true
curl.http_post(1st url,field)
curl.url = 2nd url
curl.http_get
code = curl.body_str
If this still doesn't seem to be working for you, you can verify if the cookie is getting set by adding
curl.verbose = true
Before
curl.http_post
Your Curl::Easy object will dump all the headers that it gets in the response from the server to $stdout, and somewhere in there you should see a line stating that it added/set a cookie. I don't have any example output right now but I'll try to post a follow-up soon.
HTTPClient automatically enables cookies, as does Mechanize.
From the HTTPClient docs:
clnt = HTTPClient.new
clnt.get_content(url1) # receives Cookies.
clnt.get_content(url2) # sends Cookies if needed.
Posting a form is easy too:
body = { 'keyword' => 'ruby', 'lang' => 'en' }
res = clnt.post(uri, body)
Mechanize makes this sort of thing really simple (It will handle storing the cookies, among other things).

Multipart File Upload in Ruby

I simply want to upload an image to a server with POST. As simple as this task sounds, there seems to be no simple solution in Ruby.
In my application I am using WWW::Mechanize for most things so I wanted to use it for this too, and had a source like this:
f = File.new(filename, File::RDWR)
reply = agent.post(
'http://rest-test.heroku.com',
{
:pict => f,
:function => 'picture2',
:username => #username,
:password => #password,
:pict_to => 0,
:pict_type => 0
}
)
f.close
This results in a totally garbage-ready file on the server that looks scrambled all over:
alt text http://imagehub.org/f/1tk8/garbage.png
My next step was to downgrade WWW::Mechanize to version 0.8.5. This worked until I tried to run it, which failed with an error like "Module not found in hpricot_scan.so". Using the Dependency Walker tool I could find out that hpricot_scan.so needed msvcrt-ruby18.dll. Yet after I put that .dll into my Ruby/bin-folder it gave me an empty error box from where on I couldn't debug very much further. So the problem here is that Mechanize 0.8.5 has a dependency on Hpricot instead of Nokogiri (which works flawlessly).
The next idea was to use a different gem, so I tried using Net::HTTP. After short research I could find out that there is no native support for multipart forms in Net::HTTP and instead you have to build a class that encodes etc. for you. The most helpful I could find was the Multipart-class by Stanislav Vitvitskiy. This class looked good so far, but it does not do what I need, because I don't want to post only files, I also want to post normal data, and that is not possible with his class.
My last attempt was to use RestClient. This looked promising, as there have been examples on how to upload files. Yet I can't get it to post the form as multipart.
f = File.new(filename, File::RDWR)
reply = RestClient.post(
'http://rest-test.heroku.com',
:pict => f,
:function => 'picture2',
:username => #username,
:password => #password,
:pict_to => 0,
:pict_type => 0
)
f.close
I am using http://rest-test.heroku.com which sends back the request to debug if it is sent correctly, and I always get this back:
POST http://rest-test.heroku.com/ with a 101 byte payload,
content type application/x-www-form-urlencoded
{
"pict" => "#<File:0x30d30c4>",
"username" => "s1kx",
"pict_to" => "0",
"function" => "picture2",
"pict_type" => "0",
"password" => "password"
}
This clearly shows that it does not use multipart/form-data as content-type but the standard application/x-www-form-urlencoded, although it definitely sees that pict is a file.
How can I upload a file in Ruby to a multipart form without implementing the whole encoding and data aligning myself?
Long problem, short answer: I was missing the binary mode for reading the image under Windows.
f = File.new(filename, File::RDWR)
had to be
f = File.new(filename, "rb")
Another method is to use Bash and Curl. I used this method when I wanted to test multiple file uploads.
bash_command = 'curl -v -F "file=#texas.png,texas_reversed.png"
http://localhost:9292/fog_upload/upload'
command_result = `#{bash_command}` # the backticks are important <br/>
puts command_result

Resources