Sending .gz file via CURL to RESTful put creating ZipException in GZIPInputStream - spring-boot

The application I am creating takes a gzipped file sent to a RESTful PUT, unzips the file and then does further processing like so:
public class Service {
#PUT
#Path("/{filename}")
Response doPut(#Context HttpServletRequest request,
#PathParam("filename") String filename,
InputStream inputStream) {
try {
GZIPInputStream gzipInputStream = new GZIPInputStream(inputStream);
// Do Stuff with GZIPInputStream
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
}
I am able to successfully send a gzipped file in a unit test like so:
InputStream inputStream = new FileInputStream("src/main/resources/testFile.gz);
Service service = new Service();
service.doPut(mockHttpServletRequest, "testFile.gz", inputStream);
// Verify processing stuff happens
But when I build the application and attempt to CURL the same file from the src/main/resources dir with the following I get a ZipException:
curl -v -k -X PUT --user USER:Password -H "Content-Type: application/gzip" --data-binary #testFile.gz https://myapp.dev.com/testFile.gz
The exception is:
java.util.zip.ZipException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:165)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
at Service.doPut(Service.java:23)
// etc.
So does anyone have any idea why sending the file via CURL causes the ZipException?
Update:
I ended up taking a look at the actual bytes being sent via the InputStream and figured out where the ZipException: Not in GZIP format error was coming from. The first two bytes of a GZIP file are required to be 1F and 8B respectively in order for GZIPInputStream to recognize the data as being in GZIP format. Instead the 8B byte, along with every other byte in the steam that doesn't correspond to a valid UTF-8 character, was transformed into the bytes EF, BF, BD which are the UTF-8 unknown character replacement bytes. Thus the server is reading the GZIP data as UTF-8 characters rather than as binary and is corrupting the data.
The issue I am having now is I can't figure out where I need to change the configuration in order to get the server to treat the compressed data as binary vs UTF-8. The application uses Jax-rs on a Jersey server using Spring-Boot that is deployed in a Kubernetes pod and ran as a service, so something in the setup of one of those technologies needs to be tweaked to prevent improper encoding from being used on the data.
I have tried adding -H "Content-Encoding: gzip" to the curl command, registering the EncodingFilter.class and GZipEncoder.class in jersey ResourceConfig class, adding application/gzip to the server.compression.mime-types in application.propertes, adding the #Consumes("application/gzip") annotation to the doPut method above, and several other things I can't remember off the top of my head but nothing seems to have any effect.
I am seeing the following in the verbose CURL logs:
> PUT /src/main/resources/testFile.gz
> HOST: my.host.com
> Authorization: Basic <authorization stuff>
> User-Agent: curl/7.54.1
> Accept: */*
> Content-Encoding: gzip
> Content-Type: application/gzip
> Content-Length: 31
>
} [31 bytes data]
* upload completely sent off: 31 out of 31 bytes
< HTTP/1.1 500
< X-Application-Context: application
< Content-Type: application/json;charset=UTF-8
< Transfer-Encoding: chunked
< Date: <date stuff>
...etc
Nothing I have done has affected the receiving side
Content-Type: application/json;charset=UTF-8
portion, which I suspect is the issue.

I met the same problem and finally solved it by using -H 'Content-Type:application/json;charset=UTF-8'
Use Charles to find the difference
I can successfully send the gzipped file using Postman. So I used Charles to catch two packages sent by curl and postman respectively. After I compared these two packages, I found that Postman used application/json as Content Type while curl used text/plain.
Spring docs: Content Type and Transformation
According to Spring docs, if the content type is text/plain and the source payload is byte[], Spring will convert the payload to string using charset specified in the content-type header. That's why ZipException occurred. Since the original byte data had already been decoded and not in gzip format anymore.
Spring source code
#Override
protected Object convertFromInternal(Message<?> message, Class<?> targetClass, #Nullable Object conversionHint) {
Charset charset = getContentTypeCharset(getMimeType(message.getHeaders()));
Object payload = message.getPayload();
return (payload instanceof String ? payload : new String((byte[]) payload, charset));
}

Related

Multipart request rejected because no bounday were found while boundary is already sent

I went through the other similar questions but couldn't find any solution. We have a backend service running by Spring Boot and has been working for a while now. But there is a new user of this service recently and they are using MuleSoft to send their request. But all attempts to send a file to our service fails with this error:
Failed to parse multipart servlet request; nested exception is java.io.IOException: org.apache.tomcat.util.http.fileupload.FileUploadException: the request was rejected because no multipart boundary was found"
The only difference we could find between a request from MuleSoft and say a curl command is that MuleSoft always sends the request with boundary value wrapped with double quotes
Mule request:
<Header name="Content-Type">multipart/form-data; charset=UTF-8; boundary="--------------------------669816398596264398718285"</Header>
Versus Postman/curl request:
* SSL certificate verify ok.
* Using HTTP2, server supports multiplexing
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fd656810a00)
> POST /api/upload HTTP/2
> Host: myhost
> user-agent: curl/7.79.1
> accept: */*
> content-length: 97255
> content-type: multipart/form-data; boundary=------------------------111fd08cb4fafe1c
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 128)!
* We are completely uploaded and fine
< HTTP/2 200
< date: Mon, 19 Dec 2022 04:56:25 GMT
< content-length: 0
Our controller in Spring is very simple:
#RestController
class MyController {
#PostMapping("/upload", consumes = [MediaType.MULTIPART_FORM_DATA_VALUE])
#ResponseStatus(HttpStatus.OK)
fun uploadDocument(#RequestPart("file") file: MultipartFile) {
logger.info { "ContentType: ${file.contentType}" }
logger.info { "Name: ${file.name}" }
logger.info { "Byte: ${String(file.bytes)}" }
}
}
The following curl command works fine:
curl -v -X POST -F file=#/Users/myhomefolder/Documents/some-file.jpg https://host-name/api/upload
But this script from MuleSoft doesn't (Sorry I'm not familiar with Mule, I got this code from their team):
import dw::module::Multipart
output multipart/form-data boundary = "---WebKitFormBoundary7MA4YWxkTrZu0gW"
---
{
parts : {
file : {
headers : {
"Content-Disposition" : {
"name": "file",
"filename": payload.name
},
"Content-Type" : "multipart/form-data"
},
content : payload.byteArray
}
}
}
Is there any configuration in Spring that accepts double quotes for boundary? Is there anything missing in our backend configuration that should be added to support different HTTP client?

HTTP compression with Spring Boot and Nginx

I am having a setting with an Nginx reverse proxy, a Spring Boot application, and a (Redis) cache, and would like to ask you (1) how to configure Nginx to only compress the data if it is not compressed yet, and (2) how to send compressed and cache data in Spring Boot correctly.
Current setup
The Nginx acts as a reverse proxy and compresses the data with specified content types:
gzip on;
gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;
The Spring Boot API application has a few compute-intense endpoints that manage their own cache and many lightweight endpoints that don't need caching. The response sizes are often quite large (up to multiple hundred MBs) so that the data are compressed before caching. In pseudo-code:
#RequestMapping("/uncached1")
public MyUncached1Response getUncachedData1(String query) {
return dataservice.getResults1(query);
}
#RequestMapping("/cached1")
public String getCachedData1(String query) {
if (cache.has(query)) {
return uncompress(cache.get(query));
} else {
String results = dataservice.getResults2(query);
cache.set(query, compress(results));
return results;
}
}
As you can see, the setup is compressing and uncompressing a lot. If the value has not been cached, the application compresses the results for the cache. Then, the application returns the uncompressed data and Nginx compresses it again. If the value is already in the cache, the application first uncompresses it and gives Nginx the uncompressed data for compression.
Envisioned setup
I am wondering if the following setup would be possible:
Nginx only compresses the data if it has not been compressed yet
Some endpoints of the Spring Boot application returns compressed data if the client accepts it:
#RequestMapping("/uncached1")
public MyUncached1Response getUncachedData1(String query) {
return dataservice.getResults1(query);
}
#RequestMapping("/cached1")
public byte[] getCachedData1(String query) {
if (cache.has(query)) {
byte[] compressed = cache.get(query);
if (client.acceptsGzip()) {
return compressed;
} else {
return uncompressed(compressed).toByteArray();
}
} else {
String results = dataservice.getResults2(query);
byte[] compressed = compress(results);
cache.set(query, compressed);
if (client.acceptsGzip()) {
return compressed;
} else {
return results.toByteArray();
}
}
}
Questions:
Does the envisioned setup make sense, is it possible? If yes, could you please provide me with some hints for the implementation?
If it doesn't work this way, what would be a better architecture?
Currently, I use Java's Deflater for compression, but that's not entirely the same as gzip, right? How can I compress the data compatible to gzip for HTTP?
How can I see whether the client accepts gzip?
Thanks so much!

Azure Form Recognizer training not finding data

I'm trying to train a Form Recognizer using the browser API console (https://eastus.dev.cognitive.microsoft.com/docs/services/form-recognizer-api/operations/TrainCustomModel/console). I've uploaded traning images to a container and created an SAS. The browser API console generate following HTTP request:
POST https://eastus.api.cognitive.microsoft.com/formrecognizer/v1.0-preview/custom/train?source=https://pythonimages.blob.core.windows.net/?sv=2019-02-02&ss=bfqt&srt=sco&sp=rl&se=2020-01-22T00:23:33Z&st=2020-01-21T16:23:33Z&spr=https&sig=••••••••••••••••••••••••••••••••&prefix=images HTTP/1.1
Host: eastus.api.cognitive.microsoft.com
Content-Type: application/json
Ocp-Apim-Subscription-Key: ••••••••••••••••••••••••••••••••
{
"source": "string",
"sourceFilter": {
"prefix": "string",
"includeSubFolders": true
}
}
However, the answer I get back is
Transfer-Encoding: chunked
x-envoy-upstream-service-time: 4
apim-request-id: 5ad37aa2-e251-4b61-98ae-023930b47d27
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
x-content-type-options: nosniff
Date: Tue, 21 Jan 2020 16:25:03 GMT
Content-Type: application/json; charset=utf-8
{
"error": {
"code": "1004",
"message": "Dataset path must be relative to local input mount path '/input' if local data is referenced."
}
}
I don't understand why it seems to be looking for data locally. I've experimented with the SAS, e.g. including the container name (images) in the blob http address rather than as a query parameter, but no success so far.
I've also tried the Python/REST path (described here: https://learn.microsoft.com/en-gb/azure/cognitive-services/form-recognizer/quickstarts/python-train-extract-v1), which results in a different error:
Response status code: 408
Response body: {'error': {'code': '1011', 'innerError': {'requestId': 'e7f9ef9f-97bc-4b6a-86f3-0b29c9591c87'}, 'message': 'The operation exceeded allowed time limit and was canceled. The common reasons are that the data source is too large or contains unsupported content. Please check that your request conforms to service limits and retry with redacted data source.'}}
For completeness, the code I use is as follows (key/signature *ed out:)
########### Python Form Recognizer Train #############
from requests import post as http_post
# Endpoint URL
base_url = r"https://markusformsrecognizer.cognitiveservices.azure.com/" + "/formrecognizer/v1.0-preview/custom"
source = r"https://pythonimages.blob.core.windows.net/images?sv=2019-02-02&ss=bfqt&srt=sco&sp=rl&se=2020-01-22T15:37:26Z&st=2020-01-22T07:37:26Z&spr=https&sig=*********************************"
headers = {
# Request headers
'Content-Type': 'application/json',
'Ocp-Apim-Subscription-Key': '*********************************'
}
url = base_url + "/train"
body = {"source": source}
try:
resp = http_post(url = url, json = body, headers = headers)
print("Response status code: %d" % resp.status_code)
print("Response body: %s" % resp.json())
except Exception as e:
print(str(e))
For error code 1004 Please follow the below to get the Source path containing the training documents and pass as value to the source key.
{
"source": "string",
"sourceFilter": {
"prefix": "string",
"includeSubFolders": true
}
}
Replace with the Azure Blob storage container's shared access signature (SAS) URL. To retrieve the SAS URL, open the Microsoft Azure Storage Explorer, right-click your container, and select Get shared access signature.
Make sure the Read and List permissions are checked, and click Create.
Then copy the value in the URL section. It should have the form:
https://.blob.core.windows.net/container name?SAS value.
Please use the new Form Recognizer v2.0 release it is an async API and enables training on large data sets and analyzing large documents. https://aka.ms/form-recognizer/api
quick start - https://learn.microsoft.com/en-us/azure/cognitive-services/form-recognizer/quickstarts/python-train-extract
To get started with Form Recognizer please login to the Azure Portal using this link to create a Form Recognizer resource (for v2.0 (preview) please use West US 2 or West Europe regions).
try removing the string value from prefix property.
{
"source": "string",
"sourceFilter": {
"prefix": "",
"includeSubFolders": true
}
}
The Python Quick Start code for version 2.0 seems to be working, at least I don’t get any errors anymore. I’m now feeling slightly silly that I didn’t try this earlier. The API (web-browser) console, linked from the Quick Start page of the Form Recognizer seems automatically assume I want to use version 1.0 and there’s no way to change that (or perhaps I’ve just overseen something). Hence I assumed I’d been allocated a v1.0 trial and therefore that’s what I used when I tried the Python Quick Start the first time around.
Instead of using just the SAS URI in the "source" of Request parameter on the API POST call, use the complete string of the container followed by the SAS URI token.
For ex:
https://.blob.core.windows.net//

Spring Cloud Stream w/Kafka + Confluent Schema Registry Client broken?

Curious if anyone has got this working as I'm currently struggling.
I have created simple Source and Sink applications to send and receive an Avro schema based message. The schema for the message is held in a Confluent Schema Registry. Both apps are configured to use the ConfluentSchemaRegistryClient class but I think there might be a bug in here somewhere. Here's what I see that makes me wonder.
If I interact with the Confluent registry's REST API I can see that there is only one version of the schema in question (lightly edited to obscure what I'm working on):
$ curl -i "http://schemaregistry:8081/subjects/somesubject/versions"
HTTP/1.1 200 OK
Date: Fri, 05 May 2017 16:13:37 GMT
Content-Type: application/vnd.schemaregistry.v1+json
Content-Length: 3
Server: Jetty(9.2.12.v20150709)
[1]
When the Source app sends off its message over Kafka I noticed that the version in the header looked a bit funky:
contentType"application/octet-stream"originalContentType/"application/vnd.somesubject.v845+avro"
I'm not 100% clear about why the application/vnd.somesubject.v845+avro content type is wrapped up in application/octet-stream but ignoring that, note that it is saying version 845 not version 1.
Looking at the ConfluentSchemaRegistryClient implementation I see that it POSTs to /subjects/(string: subject)/versions and returns the id of the schema not the version. This then gets put into SchemaReference's version field: https://github.com/spring-cloud/spring-cloud-stream/blob/master/spring-cloud-stream-schema/src/main/java/org/springframework/cloud/stream/schema/client/ConfluentSchemaRegistryClient.java#L81
When the Sink app tries to fetch the schema for the message based upon the header it fails because it tries to fetch version 845 that its plucked out of the header: https://github.com/spring-cloud/spring-cloud-stream/blob/master/spring-cloud-stream-schema/src/main/java/org/springframework/cloud/stream/schema/client/ConfluentSchemaRegistryClient.java#L87
Anyone have thoughts on this? Thanks in advance.
** UPDATE **
OK pretty convinced this is a bug. Took the ConfluentSchemaRegistryClient and modified the register method slightly to POST to /subjects/(string: subject) (i.e. dropped the trailing /versions) which per Confluent REST API docs returns a payload with the version in it. Works like a charm:
public SchemaRegistrationResponse register(String subject, String format, String schema) {
Assert.isTrue("avro".equals(format), "Only Avro is supported");
String path = String.format("/subjects/%s", subject);
HttpHeaders headers = new HttpHeaders();
headers.put("Accept",
Arrays.asList("application/vnd.schemaregistry.v1+json", "application/vnd.schemaregistry+json",
"application/json"));
headers.add("Content-Type", "application/json");
Integer version = null;
try {
String payload = this.mapper.writeValueAsString(Collections.singletonMap("schema", schema));
HttpEntity<String> request = new HttpEntity<>(payload, headers);
ResponseEntity<Map> response = this.template.exchange(this.endpoint + path, HttpMethod.POST, request,
Map.class);
version = (Integer) response.getBody().get("version");
}
catch (JsonProcessingException e) {
e.printStackTrace();
}
SchemaRegistrationResponse schemaRegistrationResponse = new SchemaRegistrationResponse();
schemaRegistrationResponse.setId(version);
schemaRegistrationResponse.setSchemaReference(new SchemaReference(subject, version, "avro"));
return schemaRegistrationResponse;
}

How can I Read and Transfer chunks of file with Hadoop WebHDFS?

I need to transfer big files (at least 14MB) from the Cosmos instance of the FIWARE Lab to my backend.
I used the Spring RestTemplate as a client interface for the Hadoop WebHDFS REST API described here but I run into an IO Exception:
Exception in thread "main" org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://cosmos.lab.fiware.org:14000/webhdfs/v1/user/<user.name>/<path>?op=open&user.name=<user.name>":Truncated chunk ( expected size: 14744230; actual size: 11285103); nested exception is org.apache.http.TruncatedChunkException: Truncated chunk ( expected size: 14744230; actual size: 11285103)
at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:580)
at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:545)
at org.springframework.web.client.RestTemplate.exchange(RestTemplate.java:466)
This is the actual code that generates the Exception:
RestTemplate restTemplate = new RestTemplate();
restTemplate.setRequestFactory(new HttpComponentsClientHttpRequestFactory());
restTemplate.getMessageConverters().add(new ByteArrayHttpMessageConverter());
HttpEntity<?> entity = new HttpEntity<>(headers);
UriComponentsBuilder builder =
UriComponentsBuilder.fromHttpUrl(hdfs_path)
.queryParam("op", "OPEN")
.queryParam("user.name", user_name);
ResponseEntity<byte[]> response =
restTemplate
.exchange(builder.build().encode().toUri(), HttpMethod.GET, entity, byte[].class);
FileOutputStream output = new FileOutputStream(new File(local_path));
IOUtils.write(response.getBody(), output);
output.close();
I think this is due to a transfer timeout on the Cosmos instance, so I tried to
send a curl on the path by specifying offset, buffer and length parameters, but they seem to be ignored: I got the whole file.
Thanks in advance.
Ok, I found out a solution. I don't understand why, but the transfer succeds if I use a Jetty HttpClient instead of the RestTemplate (and so Apache HttpClient). This works now:
ContentExchange exchange = new ContentExchange(true){
ByteArrayOutputStream bos = new ByteArrayOutputStream();
protected void onResponseContent(Buffer content) throws IOException {
bos.write(content.asArray(), 0, content.length());
}
protected void onResponseComplete() throws IOException {
if (getResponseStatus()== HttpStatus.OK_200) {
FileOutputStream output = new FileOutputStream(new File(<local_path>));
IOUtils.write(bos.toByteArray(), output);
output.close();
}
}
};
UriComponentsBuilder builder = UriComponentsBuilder.fromHttpUrl(<hdfs_path>)
.queryParam("op", "OPEN")
.queryParam("user.name", <user_name>);
exchange.setURL(builder.build().encode().toUriString());
exchange.setMethod("GET");
exchange.setRequestHeader("X-Auth-Token", <token>);
HttpClient client = new HttpClient();
client.setConnectorType(HttpClient.CONNECTOR_SELECT_CHANNEL);
client.setMaxConnectionsPerAddress(200);
client.setThreadPool(new QueuedThreadPool(250));
client.start();
client.send(exchange);
exchange.waitForDone();
Is there any known bug on the Apache Http Client for chunked files transfer?
Was I doing something wrong in my RestTemplate request?
UPDATE: I still don't have a solution
After few tests I see that I don't have solved my problems.
I found out that the hadoop version installed on the Cosmos instance is quite old Hadoop 0.20.2-cdh3u6 and I read that WebHDFS doesn't support partial file transfer with length parameter (introduced since v 0.23.3).
These are the headers I received from the Server when I send a GET request using curl:
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: HEAD, POST, GET, OPTIONS, DELETE
Access-Control-Allow-Headers: origin, content-type, X-Auth-Token, Tenant-ID, Authorization
server: Apache-Coyote/1.1
set-cookie: hadoop.auth="u=<user>&p=<user>&t=simple&e=1448999699735&s=rhxMPyR1teP/bIJLfjOLWvW2pIQ="; Version=1; Path=/
Content-Type: application/octet-stream; charset=utf-8
content-length: 172934567
date: Tue, 01 Dec 2015 09:54:59 GMT
connection: close
As you see the Connection header is set to close. Actually, the connection is usually closed each time the GET request lasts more than 120 seconds, even if the file transfer has not been completed.
In conclusion, I can say that Cosmos is totally useless if it doesn't support large file transfer.
Please correct me if I'm wrong, or if you know a workaround.

Resources