How can I read this binary XML response with Ruby? - ruby

I'm trying to read a response from a marketplace web service. Every other response from this web service is returned in XML format. However, this particular call is requesting a file download. I'm unfamiliar with the way in which it's returned. After looking at the contents, there is XML as well as encoded binary data which is in there as some sort of attachment.
The request I make looks like this. The request is a simple XML request:
begin
response = Net::HTTP.start(url.host, url.port, :use_ssl => url.scheme == 'https') do |http|
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
http.request(request)
end
rescue Errno::ECONNRESET => e
count += 1
retry unless count > 10
puts "Tried 10 times and couldn't get #{url.host}: #{e}"
end
Here is what the response.body looks like:
--MIMEBoundaryurn_uuid_AF2837F4196B2631EC15070889135182607126
Content-Type: application/xop+xml; charset=utf-8; type="text/xml"
Content-Transfer-Encoding: binary
Content-ID: <0.urn:uuid:AF2837F4196B2631EC15070889135182607127>
<?xml version='1.0' encoding='UTF-8'?>
<downloadFileResponse xmlns="http://www.marketplace.com/marketplace/services">
<ack>Success</ack>
<version>1.1.0</version>
<timestamp>2017-10-04T03:48:33.518Z</timestamp>
<fileAttachment>
<Size>25895</Size>
<Data><xop:Include xmlns:xop="http://www.w3.org/2004/08/xop/include" href="cid:urn:uuid:E3A8215C82DBC51E6D1507090865513"/></Data>
</fileAttachment>
</downloadFileResponse>
--MIMEBoundaryurn_uuid_AF2837F4196B2631EC15070889135182607126
Content-Type: application/zip
Content-Transfer-Encoding: binary
Content-ID: <urn:uuid:E3A8215C82DBC51E6D1507090865513>
PK&úCKÃEK¬gdi∂6509805153_report.xmlUT hH‘YhH‘Yux00ÏùÎs‚8∂¿ø˜_°€üÓ≠=-â Ù6ÚÍ< ›”≥µïr# NåÕ⁄&è˘ÎØÑÅêàçÕ‡…¶+ùnåd=|Œ—œÁ˱ۜáæÓT:æ˜Îg¥?Âu¸Æ„]ˇ˙y]ïƒÁ~˘¥≥;tokvd◊:=€ªVM|/T!–˘ΩP'
º≤∫¥Àˆ¿ Àj˜x◊U’ÔÎT ã¬œ˙-flÌ6’¿¢/ü>Ìú]‘Td;n¯e¸Ò∞ˆ L%‰åqÀ*!é, òdB±≥=Ic™Û®ÇÛpÙ©ÁªÆ≥ŸA∏≥=˚≈8ŸûÑ—©›W_v˛Á_’Z•]˘W€$Ó‹˛˚fl_∆9û“å3€/Ûòbº)œ4…8KΩØõÚî>äÀË≈Ÿ˛Ít\ÿ›Í¯˝ß{ƒy∆7hÙt_=›ÄC$·Ä3åürƒâtgˆúASuúÅ£ªwnοlç_'èo—ä•"ÙîAÇ)–ÆÔ{˜ v£ÿuÔ∫”õL2Ãf«œæ√„Ô™NÙ¯ºb{∂\Ÿ”{MSLnfGÍ,h˛ù„ufÚ}ØÃˇ<Mú≥·áëÌV˝ÓL&åuKJÿBhöy&Ÿ∏䲖ãǵ<o=UpÊ˚8«#ÎEKwŒl˝Œ-∞Ë¥O›4õÓ”N√~ÏÎ~Ø∫ T∑ÌË€aàx ¡$mƒ 
...
--MIMEBoundaryurn_uuid_AF2837F4196B2631EC15070889135182607126--
Obviously I can see there is zip data here. But I've read every other response with Hash.from_xml and obviously that's not going to work here.
Update
If I write the string to a file test.zip, I can unzip it at the Linux command line, and it creates a readable XML file after flashing this warning:
Archive: test.zip
warning [test.zip]: 822 extra bytes at beginning or within zipfile
(attempting to process anyway)
inflating: 6509805153_report.xml
Not sure what the extra bytes it's complaining about are.
Update 2
That's definitely the MIME header and the XML envelope. I can confirm that if I strip those characters out by hand, and the MIME footer, then the test file unzips without warning.
So this appears to be a zipped XML envelope containing a zip file.

This may be a bit more complicated than just using a library. Closest I could find was an implementation in the savon SOAP library that handles xop in multipart responses. You might be able to analyze the code there and come up with a solution that fits your need, or if this is a SOAP service, leverage the savon gem.
https://github.com/savonrb/savon-multipart/blob/master/lib/savon/multipart/response.rb#L63-L80

Related

POST a JSON and a CSV in a multipart/form-data request through a NiFi 1.15 InvokeHTTP processor

I'm working on a NiFi 1.15 flow where I have to send a request to a service that requires 2 pieces of form-data sent in a POST request as a multipart/form-data. The first part is a simple JSON object with a few parameters, while the second part is a CSV file. Here is an example of what I am trying to achieve.
Content-Type: multipart/form-data; boundary=1cf28ed799fe4325b8cd0637a67dc612
--1cf28ed799fe4325b8cd0637a67dc612
Content-Disposition: form-data; name="json"; filename="json"
{"Param1": "Value1","Param2": "Value2","Param3": true}
--1cf28ed799fe4325b8cd0637a67dc612
Content-Disposition: form-data; name="file"; filename="body.csv"
Field1,Field2,Field3,Field4,Field5
VALUE_FIELD_1,VALUE_FIELD_2,VALUE_FIELD_3,"Some value for field 4",VALUE_FIELD_5
--1cf28ed799fe4325b8cd0637a67dc612--
Another acceptable output would have the Content-Disposition lines empty.
Due to a few restrictions in the environment I am working on, I am not allowed to use scripting processors, such as ExecuteGroovyScript as suggested in another SO question.
Instead, I started creating a GenerateFlowFile -> InvokeHTTP flow. The GenerateFlowFile would output to a flow file a text similar to the one mentioned above. Here is the screenshot of the GenerateFlowFile.
The connected InvokeHTTP processor is configured to use the POST Http Method and to send headers (the Authorization header in my case) and Send Message Body is set to true. It also extracts the Content-Type from the flow file previsously generated attribute through a ${mime.type} function. You can see the details in the following screenshots.
Sadly, this does not work. The server responds with an "Unexpected end of MIME multipart stream. MIME multipart message is not complete." error.
After searching for a while in SO, I found another question describing what looks like a similar problem, but there they are getting a different error and is also posting parameters through a different method.
I am also aware about the blog post from Otto Fowler where he shows how InvokeHTTP supports POSTs with multipart/form-data. I did try this approach too, but did not manage to get it working. The service throws an error stating that NiFi does not send one of my post:form:parts.
Right now I am stuck and am not able to send that data. I did manage to write a simple Python script to test if the server is working properly and it is. For reference, here is the script:
import requests
server = 'https://targetserver.com'
#Authentication
result = requests.post(server + '/authentication',
{'grant_type': 'password',
'username': 'username',
'password': 'password'})
token = result.json()['access_token']
#Build the request
headers = {'Authorization': 'bearer ' + token}
json_data = '{"Param1": "Value1","Param2": "Value2","Param3": true}'
# First the JSON then the csv file.
files = {'json': json_data,
'file': open('body.csv', 'rb')}
result = requests.post(server + '/endpoint', headers = headers, files = files)
print(result.text)
Does anyone have a suggestion on how to get around this situation?

What is the "accept" part for?

When connecting to a website using Net::HTTP you can parse the URL and output each of the URL headers by using #.each_header. I understand what the encoding and the user agent and such means, but not what the "accept"=>["*/*"] part is. Is this the accepted payload? Or is it something else?
require 'net/http'
uri = URI('http://www.bible-history.com/subcat.php?id=2')
http://www.bible-history.com/subcat.php?id=2>
http_request = Net::HTTP::Get.new(uri)
http_request.each_header { |header| puts header }
# => {"accept-encoding"=>["gzip;q=1.0,deflate;q=0.6,identity;q=0.3"], "accept"=>["*/*"], "user-agent"=>["Ruby"], "host"=>["www.bible-history.com"]}
From https://www.w3.org/Protocols/HTTP/HTRQ_Headers.html#z3
This field contains a semicolon-separated list of representation schemes ( Content-Type metainformation values) which will be accepted in the response to this request.
Basically, it specifies what kinds of content you can read back. If you write an api client, you may only be interested in application/json, for example (and you couldn't care less about text/html).
In this case, your header would look like this:
Accept: application/json
And the app will know not to send any html your way.
Using the Accept header, the client can specify MIME types they are willing to accept for the requested URL. If the requested resource is e.g. available in multiple representations (e.g an image as PNG, JPG or SVG), the user agent can specify that they want the PNG version only. It is up to the server to honor this request.
In your example, the request header specifies that you are willing to accept any content type.
The header is defined in RFC 2616.

Fetching only X/HTML links (not images) based on mime type

I'm crawling a site using Ruby + OpenURI + Nokogiri. Fetch a page, find all the a[href] and (if they're in the same domain and right protocol) follow them to crawl again.
Sometimes there are links to large binaries (e.g. jpeg, exe), and I don't want to crawl those.
I tried using the HTTP "Accept" header to get an error or empty response for the wrong mime types like so:
require 'open-uri'
page = open(url, 'Accept'=>'text/html,application/xhtml+xml,application/xml')
...but OpenURI still downloads binaries sent with another mime type.
Other than looking at file extensions in the url for a probable file type, how can I prevent the download (or detect a conflicting response type) for an arbitrary URL?
You could send a HEAD request first, then check the Content-type header of the response and only make the real request if it’s acceptable:
ACCEPTABLE_TYPES = %w{text/html application/xhtml+xml application/xml}
uri = URI(url)
type = Net::HTTP.start(uri.host, uri.port) do |http|
http.head(uri.path).content_type
end
if ACCEPTABLE_TYPES.include? type
# fetch the url
else
# do whatever
end
This will need an extra request for each page, but I can’t see a way of avoiding it. It also relies on the server sending the same headers for a HEAD request as it does for a GET, which I think is a reasonable assumption but something to be aware of.

How do I process multipart http responses in Ruby Net:HTTP?

There is so much information out there on how to generate multipart responses or do multipart file uploads. I can't seem to find any information on how to process a multipart http response. Here is some IRB output from a multipart http response I am working with.
>> response.http.content_type
=> "multipart/related"
>> response.http.body[0..2048]
=> "\r\n------=_Part_3_806633756.1271797659309\r\nContent-Type: text/xml; charset=UTF-8\r\nContent-Transfer-Encoding: binary\r\nContent-Id: <A0FCC4333C6D0FCA346B97FAB6B61818>\r\n\r\n<?xml version="1.0" encoding="UTF-8"?><soapenv:Envelope xmlns:soapenv="http://www.w3.org/2003/05/soap-envelope" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><soapenv:Body><ns1:runReportResponse soapenv:encodingStyle="http://www.w3.org/2003/05/soap-encoding" xmlns:ns1="http://192.168.1.200:8080/jasperserver/services/repository"><ns2:result xmlns:ns2="http://www.w3.org/2003/05/soap-rpc">runReportReturn</ns2:result><runReportReturn xsi:type="xsd:string"><?xml version="1.0" encoding="UTF-8"?>\n<operationResult version="2.0.1">\n\t<returnCode><![CDATA[0]]></returnCode>\n</operationResult>\n</runReportReturn></ns1:runReportResponse></soapenv:Body></soapenv:Envelope>\r\n------=_Part_3_806633756.1271797659309\r\nContent-Type: application/pdf\r\nContent-Transfer-Encoding: binary\r\nContent-Id: <report>\r\n\r\n%PDF-1.4\n%\342\343\317\323\n3 0 obj
You can use Rack to do that for you, here's the utility function that does it: Rack::Utils::parse_multipart. Obviously you'll have to make your response object look like a request object Rack would accept (the env object).

Post request with body_stream and parameters

I'm building some kind of proxy.
When I call some url in a rack application, I forward that request to an other url.
The request I forward is a POST with a file and some parameters.
I want to add more parameters.
But the file can be quite big. So I send it with Net::HTTP#body_stream instead of Net::HTTP#body.
I get my request as a Rack::Request object and I create my Net::HTTP object with that.
req = Net::HTTP::Post.new(request.path_info)
req.body_stream = request.body
req.content_type = request.content_type
req.content_length = request.content_length
http = Net::HTTP.new(#host, #port)
res = http.request(req)
I've tried several ways to add the proxy's parameters. But it seems nothing in Net::HTTP allows to add parameters to a body_stream request, only to a body one.
Is there a simpler way to proxy a rack request like that ? Or a clean way to add my parameters to my request ?
Well.. as i see it, this is a normal behaviour. I'll explain why. If you only have access to a Rack::Request,(i guess that) your middleware does not parse the response (you do not include something like ActionController::ParamsParser), so you don't have access to a hash of parameters, but to a StringIo. This StringIO corresponds to a stream like:
Content-Type: multipart/form-data; boundary=AaB03x
--AaB03x
Content-Disposition: form-data; name="param1"
value1
--AaB03x
Content-Disposition: form-data; name="files"; filename="file1.txt"
Content-Type: text/plain
... contents of file1.txt ...
--AaB03x--
What you are trying to do with the Net::HTTP class is to: (1). parse the request into a hash of parameters; (2). merge the parameters hash with your own parameters; (3). recreate the request. The problem is that Net::HTTP library can't do (1), since it is a client library, not a server one.
Therefore, you can not escape parsing some how your request before adding the new parameters.
Possible solutions:
Insert ActionController::ParamsParser before your middleware. After that, you may use the excellent rest-client lib to do something like:
RestClient.post ('http://your_server' + request.path_info), :params => params.merge(your_params)
You can attempt to make a wrapper on the StringIO object, and add, at the end of stream,your own parameters. However, this is not trivial nor advisable.
Might be one year too late, but I had the same issue verifying Paypal IPNs. I wanted to forward back the IPN request to Paypal for verification but needed to add :cmd => '_notify-validate'.
Instead of modifying the body stream, or body, I appended it as part of the URL path, like so:
reply_request = Net::HTTP::Post.new(url.path + '?cmd=_notify-validate')
It seems a bit of a hack, but I think it's worth it if you aren't going to use it for anything else.

Resources