How can I Read and Transfer chunks of file with Hadoop WebHDFS? - hadoop

I need to transfer big files (at least 14MB) from the Cosmos instance of the FIWARE Lab to my backend.
I used the Spring RestTemplate as a client interface for the Hadoop WebHDFS REST API described here but I run into an IO Exception:
Exception in thread "main" org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://cosmos.lab.fiware.org:14000/webhdfs/v1/user/<user.name>/<path>?op=open&user.name=<user.name>":Truncated chunk ( expected size: 14744230; actual size: 11285103); nested exception is org.apache.http.TruncatedChunkException: Truncated chunk ( expected size: 14744230; actual size: 11285103)
at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:580)
at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:545)
at org.springframework.web.client.RestTemplate.exchange(RestTemplate.java:466)
This is the actual code that generates the Exception:
RestTemplate restTemplate = new RestTemplate();
restTemplate.setRequestFactory(new HttpComponentsClientHttpRequestFactory());
restTemplate.getMessageConverters().add(new ByteArrayHttpMessageConverter());
HttpEntity<?> entity = new HttpEntity<>(headers);
UriComponentsBuilder builder =
UriComponentsBuilder.fromHttpUrl(hdfs_path)
.queryParam("op", "OPEN")
.queryParam("user.name", user_name);
ResponseEntity<byte[]> response =
restTemplate
.exchange(builder.build().encode().toUri(), HttpMethod.GET, entity, byte[].class);
FileOutputStream output = new FileOutputStream(new File(local_path));
IOUtils.write(response.getBody(), output);
output.close();
I think this is due to a transfer timeout on the Cosmos instance, so I tried to
send a curl on the path by specifying offset, buffer and length parameters, but they seem to be ignored: I got the whole file.
Thanks in advance.

Ok, I found out a solution. I don't understand why, but the transfer succeds if I use a Jetty HttpClient instead of the RestTemplate (and so Apache HttpClient). This works now:
ContentExchange exchange = new ContentExchange(true){
ByteArrayOutputStream bos = new ByteArrayOutputStream();
protected void onResponseContent(Buffer content) throws IOException {
bos.write(content.asArray(), 0, content.length());
}
protected void onResponseComplete() throws IOException {
if (getResponseStatus()== HttpStatus.OK_200) {
FileOutputStream output = new FileOutputStream(new File(<local_path>));
IOUtils.write(bos.toByteArray(), output);
output.close();
}
}
};
UriComponentsBuilder builder = UriComponentsBuilder.fromHttpUrl(<hdfs_path>)
.queryParam("op", "OPEN")
.queryParam("user.name", <user_name>);
exchange.setURL(builder.build().encode().toUriString());
exchange.setMethod("GET");
exchange.setRequestHeader("X-Auth-Token", <token>);
HttpClient client = new HttpClient();
client.setConnectorType(HttpClient.CONNECTOR_SELECT_CHANNEL);
client.setMaxConnectionsPerAddress(200);
client.setThreadPool(new QueuedThreadPool(250));
client.start();
client.send(exchange);
exchange.waitForDone();
Is there any known bug on the Apache Http Client for chunked files transfer?
Was I doing something wrong in my RestTemplate request?
UPDATE: I still don't have a solution
After few tests I see that I don't have solved my problems.
I found out that the hadoop version installed on the Cosmos instance is quite old Hadoop 0.20.2-cdh3u6 and I read that WebHDFS doesn't support partial file transfer with length parameter (introduced since v 0.23.3).
These are the headers I received from the Server when I send a GET request using curl:
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: HEAD, POST, GET, OPTIONS, DELETE
Access-Control-Allow-Headers: origin, content-type, X-Auth-Token, Tenant-ID, Authorization
server: Apache-Coyote/1.1
set-cookie: hadoop.auth="u=<user>&p=<user>&t=simple&e=1448999699735&s=rhxMPyR1teP/bIJLfjOLWvW2pIQ="; Version=1; Path=/
Content-Type: application/octet-stream; charset=utf-8
content-length: 172934567
date: Tue, 01 Dec 2015 09:54:59 GMT
connection: close
As you see the Connection header is set to close. Actually, the connection is usually closed each time the GET request lasts more than 120 seconds, even if the file transfer has not been completed.
In conclusion, I can say that Cosmos is totally useless if it doesn't support large file transfer.
Please correct me if I'm wrong, or if you know a workaround.

Related

Apache Http Components - How to timeout CONNECT request to a proxy?

Timeout Without Using Proxy
I start netcat in my local as follows, which basically listens to connections on port 9090:
netcat -l -p 9090
And using Apache HttpComponents, I create a connection to it with a timeout of 4 seconds..
RequestConfig requestConfig = RequestConfig.custom()
.setSocketTimeout(4000)
.setConnectTimeout(4000)
.setConnectionRequestTimeout(4000)
.build();
HttpGet httpget = new HttpGet("http://127.0.0.1:9090");
httpget.setConfig(requestConfig);
try (CloseableHttpResponse response = HttpClients.createDefault().execute(httpget)) {}
In terminal (where I have netcat running) I see:
??]?D???;#???9?Mۡ?NR?w?{)?V?$?(=?&?*kj?
?5??98?#?'<?%?)g#? ?/??32?,?+?0??.?2???/??-?1???D
<!-- 4 seconds later -->
read(net): Connection reset by peer
In client side what I see is:
Exception in thread "main" org.apache.http.conn.ConnectTimeoutException:
Connect to 127.0.0.1:9090 [/127.0.0.1] failed: Read timed out
This is all expected.
Timeout Using Proxy
I change the client code slightly and configure a proxy, following the docs here.
RequestConfig requestConfig = RequestConfig.custom()
.setSocketTimeout(4000)
.setConnectTimeout(4000)
.setConnectionRequestTimeout(4000)
.build();
HttpHost proxy = new HttpHost("127.0.0.1", 9090);
DefaultProxyRoutePlanner routePlanner = new DefaultProxyRoutePlanner(proxy);
CloseableHttpClient httpclient = HttpClients.custom()
.setRoutePlanner(routePlanner)
.build();
HttpGet httpget = new HttpGet("https://127.0.0.1:9090");
httpget.setConfig(requestConfig);
try (CloseableHttpResponse response = httpclient.execute(httpget)) {}
And again start netcat, and this time on serverside
CONNECT 127.0.0.1:9090 HTTP/1.1
Host: 127.0.0.1:9090
User-Agent: Apache-HttpClient/4.4.1 (Java/1.8.0_212)
But timeout is not working for CONNECT. I just wait forever..
How can I configure the httpclient to timeout for 4 seconds just like in the first case I described?
RequestConfig only take effect once a connection to the target via the specific route has been fully established . They do not apply to the SSL handshake or any CONNECT requests that take place prior to the main message exchange.
Configure socket timeout at the ConnectionManager level to ensure connection level operations time out after a certain period of inactivity.
One possibility:
// This part is the same..
httpget.setConfig(requestConfig);
ExecutorService executorService = Executors.newSingleThreadExecutor();
Callable<CloseableHttpResponse> callable = () -> {
try (CloseableHttpResponse response = httpclient.execute(httpget)) {
return response;
}
};
Future<CloseableHttpResponse> future = executorService.submit(callable);
try {
future.get(4, TimeUnit.SECONDS);
} catch (InterruptedException | ExecutionException | TimeoutException e) {
httpget.abort();
executorService.shutdownNow();
}
But I am open to other suggestions..

Redirect HTTP requests to HTTPS in netty

I am modifying elasticsearch code to configure HTTPS without x-pack and reverse proxies.
I modified initchannel() method in the netty4HttpServerTransport file , https is working fine,but i want to redirect http to https..
The code is,
char[] password = "your5663".toCharArray();
KeyStore ks = KeyStore.getInstance("JKS");
ks.load(new FileInputStream("C:/OpenSSL-Win64/bin/keystore1.jks"),password);
KeyManagerFactory kmf = KeyManagerFactory.getInstance("SunX509");
kmf.init(ks, password);
TrustManagerFactory tmf = TrustManagerFactory.getInstance(TrustManagerFactory.getDefaultAlgorithm());
tmf.init(ks);
TrustManager[] tm = tmf.getTrustManagers();
SSLContext sslContext = SSLContext.getInstance("TLSv1.3");
sslContext .init( kmf.getKeyManagers(), tm, null);
SSLEngine sslengine = sslContext .createSSLEngine();
sslengine.setUseClientMode(false);
String[] DEFAULT_PROTOCOLS = { "TLSv1", "TLSv1.1", "TLSv1.2","TLSv1.3" };
String[] DEFAULT_CIPHERS = {"TLS_RSA_WITH_AES_128_CBC_SHA256", "TLS_RSA_WITH_AES_128_CBC_SHA"};
sslengine.setEnabledProtocols(DEFAULT_PROTOCOLS);
sslengine.setEnabledCipherSuites(DEFAULT_CIPHERS);
SslHandler sslHandler = new SslHandler(sslengine);
ch.pipeline().addLast("ssl", sslHandler);
ch.pipeline().addAfter("ssl","handshake",new StringEventHandler());
How do i make http to https redirect in this code.
Redirect works on the payload (http) level, not ssl transport level. You would need to listen on both protocol (http and https) and on the http channel you can respond with redirect status code. Long story short - there is no direct place on in your code you can do that.
Very commonly a proxy server is used for this task. I am not sure if you can do it in elasticsearch, you can try to configure a filter servlet to check the protocol respond with a redirect. This may be helpful https://github.com/elastic/elasticsearch-transport-wares
Another fact - if the redirect is for service clients (not browser based ui), the clients may/will consider a redirect response an an error response. Depending on your environment - maybe you can just expose the ssl endpoint (no redirects) and clients will have to comply
Netty has a built in handler for this, OptionalSslHandler.
You put it at the front of your pipeline and it detects if the message is encrypted or not. If it is, then the message will be sent onto the normal SSL pipeline, if not then you can specify somewhere else to send it, e.g. to a 301 redirect to https.
You could either use this Netty version or make your own handler that does something similar.
However, to use the Netty version you will need to refactor slightly to produce a Netty SslContext io.netty.handler.ssl.SslContext, instead of an SSLEngine.
Something like this:
char[] password = "your5663".toCharArray();
KeyStore ks = KeyStore.getInstance("JKS");
ks.load(new FileInputStream("C:/OpenSSL-Win64/bin/keystore1.jks"),password);
KeyManagerFactory kmf = KeyManagerFactory.getInstance("SunX509");
kmf.init(ks, password);
SslContext sslContext = SslContextBuilder.forServer(keyManagerFactory).build();
ch.pipeline().addLast("ssl", sslHandler);
// this is an imaginary handler you create that sends HTTP a 301 to HTTPS
// non-SSL can be detected easily because there is no SslHandler on this channel
ch.pipeline().addLast("redirectHandler", new RedirectHandler());
ch.pipeline().addLast("handshake",new StringEventHandler());

Sending .gz file via CURL to RESTful put creating ZipException in GZIPInputStream

The application I am creating takes a gzipped file sent to a RESTful PUT, unzips the file and then does further processing like so:
public class Service {
#PUT
#Path("/{filename}")
Response doPut(#Context HttpServletRequest request,
#PathParam("filename") String filename,
InputStream inputStream) {
try {
GZIPInputStream gzipInputStream = new GZIPInputStream(inputStream);
// Do Stuff with GZIPInputStream
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
}
I am able to successfully send a gzipped file in a unit test like so:
InputStream inputStream = new FileInputStream("src/main/resources/testFile.gz);
Service service = new Service();
service.doPut(mockHttpServletRequest, "testFile.gz", inputStream);
// Verify processing stuff happens
But when I build the application and attempt to CURL the same file from the src/main/resources dir with the following I get a ZipException:
curl -v -k -X PUT --user USER:Password -H "Content-Type: application/gzip" --data-binary #testFile.gz https://myapp.dev.com/testFile.gz
The exception is:
java.util.zip.ZipException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:165)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
at Service.doPut(Service.java:23)
// etc.
So does anyone have any idea why sending the file via CURL causes the ZipException?
Update:
I ended up taking a look at the actual bytes being sent via the InputStream and figured out where the ZipException: Not in GZIP format error was coming from. The first two bytes of a GZIP file are required to be 1F and 8B respectively in order for GZIPInputStream to recognize the data as being in GZIP format. Instead the 8B byte, along with every other byte in the steam that doesn't correspond to a valid UTF-8 character, was transformed into the bytes EF, BF, BD which are the UTF-8 unknown character replacement bytes. Thus the server is reading the GZIP data as UTF-8 characters rather than as binary and is corrupting the data.
The issue I am having now is I can't figure out where I need to change the configuration in order to get the server to treat the compressed data as binary vs UTF-8. The application uses Jax-rs on a Jersey server using Spring-Boot that is deployed in a Kubernetes pod and ran as a service, so something in the setup of one of those technologies needs to be tweaked to prevent improper encoding from being used on the data.
I have tried adding -H "Content-Encoding: gzip" to the curl command, registering the EncodingFilter.class and GZipEncoder.class in jersey ResourceConfig class, adding application/gzip to the server.compression.mime-types in application.propertes, adding the #Consumes("application/gzip") annotation to the doPut method above, and several other things I can't remember off the top of my head but nothing seems to have any effect.
I am seeing the following in the verbose CURL logs:
> PUT /src/main/resources/testFile.gz
> HOST: my.host.com
> Authorization: Basic <authorization stuff>
> User-Agent: curl/7.54.1
> Accept: */*
> Content-Encoding: gzip
> Content-Type: application/gzip
> Content-Length: 31
>
} [31 bytes data]
* upload completely sent off: 31 out of 31 bytes
< HTTP/1.1 500
< X-Application-Context: application
< Content-Type: application/json;charset=UTF-8
< Transfer-Encoding: chunked
< Date: <date stuff>
...etc
Nothing I have done has affected the receiving side
Content-Type: application/json;charset=UTF-8
portion, which I suspect is the issue.
I met the same problem and finally solved it by using -H 'Content-Type:application/json;charset=UTF-8'
Use Charles to find the difference
I can successfully send the gzipped file using Postman. So I used Charles to catch two packages sent by curl and postman respectively. After I compared these two packages, I found that Postman used application/json as Content Type while curl used text/plain.
Spring docs: Content Type and Transformation
According to Spring docs, if the content type is text/plain and the source payload is byte[], Spring will convert the payload to string using charset specified in the content-type header. That's why ZipException occurred. Since the original byte data had already been decoded and not in gzip format anymore.
Spring source code
#Override
protected Object convertFromInternal(Message<?> message, Class<?> targetClass, #Nullable Object conversionHint) {
Charset charset = getContentTypeCharset(getMimeType(message.getHeaders()));
Object payload = message.getPayload();
return (payload instanceof String ? payload : new String((byte[]) payload, charset));
}

Spring Cloud Stream w/Kafka + Confluent Schema Registry Client broken?

Curious if anyone has got this working as I'm currently struggling.
I have created simple Source and Sink applications to send and receive an Avro schema based message. The schema for the message is held in a Confluent Schema Registry. Both apps are configured to use the ConfluentSchemaRegistryClient class but I think there might be a bug in here somewhere. Here's what I see that makes me wonder.
If I interact with the Confluent registry's REST API I can see that there is only one version of the schema in question (lightly edited to obscure what I'm working on):
$ curl -i "http://schemaregistry:8081/subjects/somesubject/versions"
HTTP/1.1 200 OK
Date: Fri, 05 May 2017 16:13:37 GMT
Content-Type: application/vnd.schemaregistry.v1+json
Content-Length: 3
Server: Jetty(9.2.12.v20150709)
[1]
When the Source app sends off its message over Kafka I noticed that the version in the header looked a bit funky:
contentType"application/octet-stream"originalContentType/"application/vnd.somesubject.v845+avro"
I'm not 100% clear about why the application/vnd.somesubject.v845+avro content type is wrapped up in application/octet-stream but ignoring that, note that it is saying version 845 not version 1.
Looking at the ConfluentSchemaRegistryClient implementation I see that it POSTs to /subjects/(string: subject)/versions and returns the id of the schema not the version. This then gets put into SchemaReference's version field: https://github.com/spring-cloud/spring-cloud-stream/blob/master/spring-cloud-stream-schema/src/main/java/org/springframework/cloud/stream/schema/client/ConfluentSchemaRegistryClient.java#L81
When the Sink app tries to fetch the schema for the message based upon the header it fails because it tries to fetch version 845 that its plucked out of the header: https://github.com/spring-cloud/spring-cloud-stream/blob/master/spring-cloud-stream-schema/src/main/java/org/springframework/cloud/stream/schema/client/ConfluentSchemaRegistryClient.java#L87
Anyone have thoughts on this? Thanks in advance.
** UPDATE **
OK pretty convinced this is a bug. Took the ConfluentSchemaRegistryClient and modified the register method slightly to POST to /subjects/(string: subject) (i.e. dropped the trailing /versions) which per Confluent REST API docs returns a payload with the version in it. Works like a charm:
public SchemaRegistrationResponse register(String subject, String format, String schema) {
Assert.isTrue("avro".equals(format), "Only Avro is supported");
String path = String.format("/subjects/%s", subject);
HttpHeaders headers = new HttpHeaders();
headers.put("Accept",
Arrays.asList("application/vnd.schemaregistry.v1+json", "application/vnd.schemaregistry+json",
"application/json"));
headers.add("Content-Type", "application/json");
Integer version = null;
try {
String payload = this.mapper.writeValueAsString(Collections.singletonMap("schema", schema));
HttpEntity<String> request = new HttpEntity<>(payload, headers);
ResponseEntity<Map> response = this.template.exchange(this.endpoint + path, HttpMethod.POST, request,
Map.class);
version = (Integer) response.getBody().get("version");
}
catch (JsonProcessingException e) {
e.printStackTrace();
}
SchemaRegistrationResponse schemaRegistrationResponse = new SchemaRegistrationResponse();
schemaRegistrationResponse.setId(version);
schemaRegistrationResponse.setSchemaReference(new SchemaReference(subject, version, "avro"));
return schemaRegistrationResponse;
}

What can cause data written to a TCP socket on windows to be altered before transmiission?

Something is altering data written to TCP sockets on my Windows (Windows 7) machine - specifically, when the bytes follow a specific HTTP POST pattern, the pattern is repeated when the bytes are read from the corresponding listener socket side of the connection.
The following bytes are written to the client socket (note: each line ends with a carriage-return and newline and the two nonblank lines are followed by two blank lines):
POST / HTTP/1.1
Transfer-Encoding: chunked
what is read from the listener socket is:
POST / HTTP/1.1
Transfer-Encoding: chunked
POST / HTTP/1.1
Transfer-Encoding: chunked
I've tested this on the loopback (127.0.0.1) address on my machine, but I've also seen the modified bytes when the listener socket was on another machine, so it appears the bytes are modified on the client side. I've reproduced the problem using both netcat and a java program (see below) on my machine, so the issue appears to be in the TCP stack. I've only been able to cause it with a specific set of HTTP headers, so it appears that something is doing deep packet inspection on my TCP communication and altering it. If I alter the input bytes slightly (e.g. so it is not a valid HTTP request by, for instance, changing "POST" to "QOST", it works fine).
Below is a java program I've written that demonstrates this and its output:
import java.io.InputStream;
import java.io.OutputStream;
import java.net.InetAddress;
import java.net.ServerSocket;
import java.net.Socket;
import java.nio.charset.Charset;
public final class Main {
private static final String PAYLOAD
= "POST / HTTP/1.1\r\n"
+ "Transfer-Encoding: chunked\r\n"
+ "\r\n"
+ "\r\n"
;
private static final int PORT = 8080;
public static final void main(final String[] args) throws Exception {
final Thread serverThread = new Thread(new Server());
serverThread.start();
final byte[] payloadBytes = PAYLOAD.getBytes(Charset.forName("UTF-8"));
int i = 0;
try (final Socket socket = new Socket(InetAddress.getLoopbackAddress(), PORT)) {
socket.setTcpNoDelay(true);
final OutputStream os = socket.getOutputStream();
for (final byte byteValue : payloadBytes) {
os.write(byteValue);
os.flush();
i++;
}
}
serverThread.join();
System.out.println("bytes written: " + i);
}
private static final class Server implements Runnable {
#Override
public void run() {
try (final ServerSocket serverSocket = new ServerSocket(PORT)) {
// while (true) {
final Socket socket = serverSocket.accept();
socket.setTcpNoDelay(true);
try (final InputStream is = socket.getInputStream()) {
int i = 0;
int byteValue;
while ((byteValue = is.read()) >= 0) {
System.out.print((char) byteValue);
System.out.flush();
i++;
}
System.out.println("----------------");
System.out.println("bytes read: " + i);
}
// }
} catch (final Exception e) {
throw new RuntimeException(e);
}
}
}
}
output:
POST / HTTP/1.1
Transfer-Encoding: chunked
POST / HTTP/1.1
Transfer-Encoding: chunked
----------------
bytes read: 96
bytes written: 49
Below is the same test using netcat (nc.exe from cygwin) on windows (note: the file test_payload.blob contains the bytes as described above and derived from the PAYLOAD constant in the java program):
start the nc listener:
nc -l 8080 > nc_capture; more nc_capture
run the nc client (in another shell from the listener):
nc -v 127.0.0.1 8080 < test_payload.blob
the output written to nc_capture:
POST / HTTP/1.1
Transfer-Encoding: chunked
POST / HTTP/1.1
Transfer-Encoding: chunked
My first thought was a buggy firewall, so I disabled it, but it still happens. I also tried resetting my winsock and tcp/ip, and it still happens. I tried disabling all of my network adapters (the above tests work on the loopback IP address, so they are not needed), and it still happens. At this point, I am pretty much out of ideas, and I don't even know how I would go about trying to debug this at a lower level. Has anyone ever seen something like this before? Is there some low level diagnostic tool on windows that I can use to see what might have its hooks into my TCP stack?

Resources