Stream a HTTP response in Java - httpresponse

I want to write the response on an HTTP request to a File. However I want to stream the response to a physical file without waiting for the entire response to be loaded.
I will actually be making a request to a JHAT server for returning all the Strings from the dump. My browser hangs before the response completes as there are 70k such objects, I wanted to write them to a file so that I can scan through.
thanks in advance,

Read a limited amount of data from the HTTP stream and write it to a file stream. Do this until all data has been handled.
Here is example code demonstrating the principle. In this example I do not deal with any i/o errors. I chose an 8KB buffer to be faster than processing one byte at a time, yet still limiting the amount of data pulled into RAM during each iteration.
final URL url = new URL("http://example.com/");
final InputStream istream = url.openStream();
final OutputStream ostream = new FileOutputStream("/tmp/data.txt");
final byte[] buffer = new byte[1024*8];
while (true) {
final int len = istream.read(buffer);
if (len <= 0) {
break;
}
ostream.write(buffer, 0, len);
}

Related

Video - stream slow using s3 content store

Video - stream - this works.. however on larger files it is slow..
How can I improve the content s3 store to deliver the content faster ?
I have tried returning a byte array and copy to a buffer.. all load.. just slow... I am not sure where the bottle neck is coming from..
Optional<File> f = filesRepo.findById(id);
if (f.isPresent()) {
InputStreamResource inputStreamResource = new InputStreamResource(contentStore.getContent(f.get()));
HttpHeaders headers = new HttpHeaders();
headers.setContentLength(f.get().getContentLength());
headers.set("Content-Type", f.get().getMimeType());
return new ResponseEntity<Object>(inputStreamResource, headers, HttpStatus.OK);
}
also get this warning:
Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.

Spark Streaming: Micro batches Parallel Execution

We are receiving data in spark streaming from Kafka. Once execution has been started in Spark Streaming, it executes only one batch and the remaining batches starting queuing up in Kafka.
Our data is independent and can be processes in Parallel.
We tried multiple configurations with multiple executor, cores, back pressure and other configurations but nothing worked so far. There are a lot messages queued and only one micro batch has been processed at a time and rest are remained in queue.
We want to achieve parallelism at maximum, so that not any micro batch is queued, as we have enough resources available. So how we can reduce time by maximum utilization of resources.
// Start reading messages from Kafka and get DStream
final JavaInputDStream<ConsumerRecord<String, byte[]>> consumerStream = KafkaUtils.createDirectStream(
getJavaStreamingContext(), LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe("TOPIC_NAME",
sparkServiceConf.getKafkaConsumeParams()));
ThreadContext.put(Constants.CommonLiterals.LOGGER_UID_VAR, CommonUtils.loggerUniqueId());
JavaDStream<byte[]> messagesStream = consumerStream.map(new Function<ConsumerRecord<String, byte[]>, byte[]>() {
private static final long serialVersionUID = 1L;
#Override
public byte[] call(ConsumerRecord<String, byte[]> kafkaRecord) throws Exception {
return kafkaRecord.value();
}
});
// Decode each binary message and generate JSON array
JavaDStream<String> decodedStream = messagesStream.map(new Function<byte[], String>() {
private static final long serialVersionUID = 1L;
#Override
public String call(byte[] asn1Data) throws Exception {
if(asn1Data.length > 0) {
try (InputStream inputStream = new ByteArrayInputStream(asn1Data);
Writer writer = new StringWriter(); ) {
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(asn1Data);
GZIPInputStream gzipInputStream = new GZIPInputStream(byteArrayInputStream);
byte[] buffer = new byte[1024];
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
int len;
while((len = gzipInputStream.read(buffer)) != -1) {
byteArrayOutputStream.write(buffer, 0, len);
}
return new String(byteArrayOutputStream.toByteArray());
} catch (Exception e) {
//
producer.flush();
throw e;
}
}
return null;
}
});
// publish generated json gzip to kafka
cache.foreachRDD(new VoidFunction<JavaRDD<String>>() {
private static final long serialVersionUID = 1L;
#Override
public void call(JavaRDD<String> jsonRdd4DF) throws Exception {
//Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
if(!jsonRdd4DF.isEmpty()) {
//JavaRDD<String> jsonRddDF = getJavaSparkContext().parallelize(jsonRdd4DF.collect());
Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
SparkAIRMainJsonProcessor airMainJsonProcessor = new SparkAIRMainJsonProcessor();
airMainJsonProcessor.processAIRData(json, sparkSession);
}
}
});
getJavaStreamingContext().start();
getJavaStreamingContext().awaitTermination();
getJavaStreamingContext().stop();
Technology that we are using:
HDFS 2.7.1.2.5
YARN + MapReduce2 2.7.1.2.5
ZooKeeper 3.4.6.2.5
Ambari Infra 0.1.0
Ambari Metrics 0.1.0
Kafka 0.10.0.2.5
Knox 0.9.0.2.5
Ranger 0.6.0.2.5
Ranger KMS 0.6.0.2.5
SmartSense 1.3.0.0-1
Spark2 2.0.x.2.5
Statistics that we got from difference experimentations:
Experiment 1
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 48 Minutes
Experiment 2
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 8 Minutes
Experiment 3
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 7 Minutes
Experiment 4
spark.default.parallelism=16
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 10 Minutes
Please advise, how we can process maximum so no queued.
I was facing same issue and I tried a few things in trying to resolve the issue and came to following findings:
First of all. Intuition says that one batch must be processed per executor but on the contrary, only one batch is processed at a time but jobs and tasks are processed in parallel.
Multiple batch processing can be achieved by using spark.streaming.concurrentjobs, but it's not documented and still needs a few fixes. One of problems is with saving Kafka offsets. Suppose we set this parameter to 4 and 4 batches are processed in parallel, what if 3rd batch finishes before 4th one, which Kafka offsets would be committed. This parameter is quite useful if batches are independent.
spark.default.parallelism because of its name is sometimes considered to make things parallel. But its true benefit is in distributed shuffle operations. Try different numbers and find an optimum number for this. You will get a considerable difference in processing time. It depends upon shuffle operations in your jobs. Setting it too high would decrease the performance. It's apparent from you experiments results too.
Another option is to use foreachPartitionAsync in place of foreach on RDD. But I think foreachPartition is better as foreachPartitionAsync would queue up the jobs whereas batches would appear to be processed but their jobs would still be in the queue or in processing. May be I didn't get its usage right. But it behaved same in my 3 services.
FAIR spark.scheduler.mode must be used for jobs with lots of tasks as round-robin assignment of tasks to jobs, gives opportunity to smaller tasks to start receiving resources while bigger tasks are processing.
Try to tune your batch duration+input size and always keep it below processing duration otherwise you're gonna see a long backlog of batches.
These are my findings and suggestions, however, there are so many configurations and methods to do streaming and often one set of operation doesn't work for others. Spark Streaming is all about learning, putting your experience and anticipation together to get to a set of optimum configuration.
Hope it helps. It would be a great relief if someone could tell specifically how we can legitimately process batches in parallel.
We want to achieve parallelism at maximum, so that not any micro batch is queued
That's the thing about stream processing: you process the data in the order it was received. If you process your data at the rate slower than it arrives it will be queued. Also, don't expect that processing of one record will suddenly be parallelized across multiple nodes.
From your screenshot, it seems your batch time is 10 seconds and your producer published 100 records over 90 seconds.
It took 36s to process 2 records and 70s to process 17 records. Clearly, there is some per-batch overhead. If this dependency is linear, it would take only 4:18 to process all 100 records in a single mini-batch thus beating your record holder.
Since your code is not complete, it's hard to tell what exactly takes so much time. Transformations in the code look fine but probably the action (or subsequent transformations) are the real bottlenecks. Also, what's with producer.flush() which wasn't mentioned anywhere in your code?
I was facing the same issue and I solved it using Scala Futures.
Here are some link that show how to use it:
https://alvinalexander.com/scala/how-use-multiple-scala-futures-in-for-comprehension-loop
https://www.beyondthelines.net/computing/scala-future-and-execution-context/
Also, this is piece of my code when I used Scala Futures:
messages.foreachRDD{ rdd =>
val f = Future {
// sleep(100)
val newRDD = rdd.map{message =>
val req_message = message.value()
(message.value())
}
println("Request messages: " + newRDD.count())
var resultrows = newRDD.collect()//.collectAsList()
processMessage(resultrows, mlFeatures: MLFeatures, conf)
println("Inside scala future")
1
}
f.onComplete {
case Success(messages) => println("yay!")
case Failure(exception) => println("On no!")
}
}
It's hard to tell without having all the details, but general advice to tackle issues like that -- start with very simple application, "Hello world" kind. Just read from input stream and print data into log file. Once this works you prove that problem was in application and you gradually add your functionality back until you find what was culprit. If even simplest app doesn't work - you know that problem in configuration or Spark cluster itself. Hope this helps.

CAPL Multiframe handling

I am writting a CAPL for Diagnostic request and response, I can get response if the data is up to 8 bytes, if data is multiframe I am not getting respone and the message on the trace is "Breaking connection between server and tester", how to handle this? I know about the CANTP frames but in this case it should handle by CAN/Canoe .
Please read CANoe ISO-TP protocol. In case of multiframe response, the tester has to send the flow control frame which provides synchronization between Sender and Receiver, which is usually 0x30. It also has fields for Block size of continous frames and seperation time. Try the below CAPL code.
variables
{
message 0x710 msg = { dlc=8,dir = rx };
byte check_byte0;
}
on message 0x718
{
check_byte0 = this.byte(0) & 0x30;
if(check_byte0 == 0x10)
{
msg.dword(0)=0x30;
msg.dword(4)=0x00;
output(msg2);
}
}
I was trying to send the request over a message ID in most gross form like 22 XX YY , which is a read DID request,this works well if the response is less than 8 bytes, if response is more than 8 bytes this wont work. so we need to use the Diagnostic objects for the request and response as defined in the CDD(or any description file) as used in the project.
If you are not using CDD, in such cases you need to use CCI (Capl call back interfaces), mostly that is necessary for simulation setups.

AWS multipart upload from inputStream has bad offfset

I am using the Java Amazon AWS SDK to perform some multipart uploads from HDFS to S3. My code is the following:
for (int i = startingPart; currentFilePosition < contentLength ; i++)
{
FSDataInputStream inputStream = fs.open(new Path(hdfsFullPath));
// Last part can be less than 5 MB. Adjust part size.
partSize = Math.min(partSize, (contentLength - currentFilePosition));
// Create request to upload a part.
UploadPartRequest uploadRequest = new UploadPartRequest()
.withBucketName(bucket).withKey(s3Name)
.withUploadId(currentUploadId)
.withPartNumber(i)
.withFileOffset(currentFilePosition)
.withInputStream(inputStream)
.withPartSize(partSize);
// Upload part and add response to our list.
partETags.add(s3Client.uploadPart(uploadRequest).getPartETag());
currentFilePosition += partSize;
inputStream.close();
lastFilePosition = currentFilePosition;
}
However, the uploaded file is not the same as the original one. More specifically, I am testing on a test file, which has about 20 MB. The parts I upload are 5 MB each. At the end of each 5MB part, I see some extra text, which is always 96 characters long.
Even stranger, if I add something stupid to .withFileOffset(), for example,
.withFileOffset(currentFilePosition-34)
the error stays the same. I was expecting to get other characters, but I am getting the EXACT 96 extra characters as if I hadn't modified the line.
Any ideas what might be wrong?
Thanks,
Serban
I figured it out. This came from a stupid assumption on my part. It turns out, the file offset in ".withFileOffset(...)" tells you the offset where to write in the destination file. It doesn't say anything about the source. By opening and closing the stream repeatedly, I am always writing from the beginning of the file, but to a different offset. The solution is to add a seek statement after opening the stream:
FSDataInputStream inputStream = fs.open(new Path(hdfsFullPath));
inputStream.seek(currentFilePosition);

routing files with zeromq (jeromq)

I'm trying to implement a "file dispatcher" on zmq (actually jeromq, I'd rather avoid jni).
What I need is to load balance incoming files to processors:
each file is handled only by one processor
files are potentially large so I need to manage the file transfer
Ideally I would like something like https://github.com/zeromq/filemq but
with a push/pull behaviour rather than publish/subscribe
being able to handle the received file rather than writing it to disk
My idea is to use a mix of taskvent/tasksink and asyncsrv samples.
Client side:
one PULL socket to be notified of a file to be processed
one DEALER socket to handle the (async) file transfer chunk by chunk
Server side:
one PUSH socket to dispatch incoming file (names)
one ROUTER socket to handle file requests
a few DEALER workers managing the file transfers for clients and connected to the router via an inproc proxy
My first question is: does this seem like the right way to go? Anything simpler maybe?
My second question is: my current implem gets stuck on sending out the actual file data.
clients are notified by the server, and issue a request.
the server worker gets the request, and writes the response back to the inproc queue but the response never seems to go out of the server (can't see it in wireshark) and the client is stuck on the poller.poll awaiting the response.
It's not a matter of sockets being full and dropping data, I'm starting with very small files sent in one go.
Any insight?
Thanks!
==================
Following raffian's advice I simplified my code, removing the push/pull extra socket (it does make sense now that you say it)
I'm left with the "non working" socket!
Here's my current code. It has many flaws that are out of scope for now (client ID, next chunk etc..)
For now, I'm just trying to have both guys talking to each other roughly in that sequence
Server
object FileDispatcher extends App
{
val context = ZMQ.context(1)
// server is the frontend that pushes filenames to clients and receives requests
val server = context.socket(ZMQ.ROUTER)
server.bind("tcp://*:5565")
// backend handles clients requests
val backend = context.socket(ZMQ.DEALER)
backend.bind("inproc://backend")
// files to dispatch given in arguments
args.toList.foreach { filepath =>
println(s"publish $filepath")
server.send("newfile".getBytes(), ZMQ.SNDMORE)
server.send(filepath.getBytes(), 0)
}
// multithreaded server: router hands out requests to DEALER workers via a inproc queue
val NB_WORKERS = 1
val workers = List.fill(NB_WORKERS)(new Thread(new ServerWorker(context)))
workers foreach (_.start)
ZMQ.proxy(server, backend, null)
}
class ServerWorker(ctx: ZMQ.Context) extends Runnable
{
override def run()
{
val worker = ctx.socket(ZMQ.DEALER)
worker.connect("inproc://backend")
while (true)
{
val zmsg = ZMsg.recvMsg(worker)
zmsg.pop // drop inner queue envelope (?)
val cmd = zmsg.pop //cmd is used to continue/stop
cmd.toString match {
case "get" =>
val file = zmsg.pop.toString
println(s"clientReq: cmd: $cmd , file:$file")
//1- brute force: ignore cmd and send full file in one go!
worker.send("eof".getBytes, ZMQ.SNDMORE) //header indicates this is the last chunk
val bytes = io.Source.fromFile(file).mkString("").getBytes //dirty read, for testing only!
worker.send(bytes, 0)
println(s"${bytes.size} bytes sent for $file: "+new String(bytes))
case x => println("cmd "+x+" not implemented!")
}
}
}
}
client
object FileHandler extends App
{
val context = ZMQ.context(1)
// client is notified of new files then fetches file from server
val client = context.socket(ZMQ.DEALER)
client.connect("tcp://*:5565")
val poller = new ZMQ.Poller(1) //"poll" responses
poller.register(client, ZMQ.Poller.POLLIN)
while (true)
{
poller.poll
val zmsg = ZMsg.recvMsg(client)
val cmd = zmsg.pop
val data = zmsg.pop
// header is the command/action
cmd.toString match {
case "newfile" => startDownload(data.toString)// message content is the filename to fetch
case "chunk" => gotChunk(data.toString, zmsg.pop.getData) //filename, chunk
case "eof" => endDownload(data.toString, zmsg.pop.getData) //filename, last chunk
}
}
def startDownload(filename: String)
{
println("got notification: start download for "+filename)
client.send("get".getBytes, ZMQ.SNDMORE) //command header
client.send(filename.getBytes, 0)
}
def gotChunk(filename: String, bytes: Array[Byte])
{
println("got chunk for "+filename+": "+new String(bytes)) //callback the user here
client.send("next".getBytes, ZMQ.SNDMORE)
client.send(filename.getBytes, 0)
}
def endDownload(filename: String, bytes: Array[Byte])
{
println("got eof for "+filename+": "+new String(bytes)) //callback the user here
}
}
On the client, you don't need PULL with DEALER.
DEALER is PUSH and PULL combined, so use DEALER only, your code will be simpler.
Same goes for the server, unless you're doing something special, you don't need PUSH with ROUTER, router is bidirectional.
the server worker gets the request, and writes the response back to
the inproc queue but the response never seems to go out of the server
(can't see it in wireshark) and the client is stuck on the poller.poll
awaiting the response.
Code Problems
In the server, you're dispatching files with args.toList.foreach before starting the proxy, this is probably why nothing is leaving the server. Start the proxy first, then use it; Also, once you call ZMQProxy(..), the code blocks indefinitely, so you'll need a separate thread to send the filepaths.
The client may have an issue with the poller. The typical pattern for polling is:
ZMQ.Poller items = new ZMQ.Poller (1);
items.register(receiver, ZMQ.Poller.POLLIN);
while (true) {
items.poll(TIMEOUT);
if (items.pollin(0)) {
message = receiver.recv(0);
In the above code, 1) poll until timeout, 2) then check for messages, and if available, 3) get with receiver.recv(0). But in your code, you poll then drop into recv() without checking. You need to check if the poller has messages for that polled socket before calling recv(), otherwise, the receiver will hang if there's no messages.

Resources