AWS multipart upload from inputStream has bad offfset - hadoop

I am using the Java Amazon AWS SDK to perform some multipart uploads from HDFS to S3. My code is the following:
for (int i = startingPart; currentFilePosition < contentLength ; i++)
{
FSDataInputStream inputStream = fs.open(new Path(hdfsFullPath));
// Last part can be less than 5 MB. Adjust part size.
partSize = Math.min(partSize, (contentLength - currentFilePosition));
// Create request to upload a part.
UploadPartRequest uploadRequest = new UploadPartRequest()
.withBucketName(bucket).withKey(s3Name)
.withUploadId(currentUploadId)
.withPartNumber(i)
.withFileOffset(currentFilePosition)
.withInputStream(inputStream)
.withPartSize(partSize);
// Upload part and add response to our list.
partETags.add(s3Client.uploadPart(uploadRequest).getPartETag());
currentFilePosition += partSize;
inputStream.close();
lastFilePosition = currentFilePosition;
}
However, the uploaded file is not the same as the original one. More specifically, I am testing on a test file, which has about 20 MB. The parts I upload are 5 MB each. At the end of each 5MB part, I see some extra text, which is always 96 characters long.
Even stranger, if I add something stupid to .withFileOffset(), for example,
.withFileOffset(currentFilePosition-34)
the error stays the same. I was expecting to get other characters, but I am getting the EXACT 96 extra characters as if I hadn't modified the line.
Any ideas what might be wrong?
Thanks,
Serban

I figured it out. This came from a stupid assumption on my part. It turns out, the file offset in ".withFileOffset(...)" tells you the offset where to write in the destination file. It doesn't say anything about the source. By opening and closing the stream repeatedly, I am always writing from the beginning of the file, but to a different offset. The solution is to add a seek statement after opening the stream:
FSDataInputStream inputStream = fs.open(new Path(hdfsFullPath));
inputStream.seek(currentFilePosition);

Related

How do I write data binary to gcs with ruby efficiently?

I want to upload data binary directly to GCP storage, without writing the file to disk. Below is the code snippet I have created to get to the state that I am going to be at.
require 'google/cloud/storage'
bucket_name = '-----'
data = File.open('image_block.jpg', 'rb') {|file| file.read }
storage = Google::Cloud::Storage.new("project_id": "maybe-i-will-tell-u")
bucket = storage.bucket bucket_name, skip_lookup: true
Now I want to directly put this data into a file on gcs, without having to write a file to disk.
Is there an efficient way we can do that?
I tried the following code
to_send = StringIO.new(data).read
bucket.create_file to_send, "image_inder_11111.jpg"
but this throws an error saying
/google/cloud/storage/bucket.rb:2898:in `file?': path name contains null byte (ArgumentError)
from /home/inder/.gem/gems/google-cloud-storage-1.36.1/lib/google/cloud/storage/bucket.rb:2898:in `ensure_io_or_file_exists!'
from /home/inder/.gem/gems/google-cloud-storage-1.36.1/lib/google/cloud/storage/bucket.rb:1566:in `create_file'
from champa.rb:14:in `<main>'
As suggested by #stefan, It should be to_send = StringIO.new(data), i.e. without .read (which would return a string again)

Video - stream slow using s3 content store

Video - stream - this works.. however on larger files it is slow..
How can I improve the content s3 store to deliver the content faster ?
I have tried returning a byte array and copy to a buffer.. all load.. just slow... I am not sure where the bottle neck is coming from..
Optional<File> f = filesRepo.findById(id);
if (f.isPresent()) {
InputStreamResource inputStreamResource = new InputStreamResource(contentStore.getContent(f.get()));
HttpHeaders headers = new HttpHeaders();
headers.setContentLength(f.get().getContentLength());
headers.set("Content-Type", f.get().getMimeType());
return new ResponseEntity<Object>(inputStreamResource, headers, HttpStatus.OK);
}
also get this warning:
Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.

How to make a downloader in java

I am making a downloader in java to download small to large files.
My logic to download files is as follows
URL url=new URL(urlToGetFile);
int count=-1; //this is for counter
int offset=0;
BufferedInputStream bufferedInputStream=new BufferedInputStream(url.openStream());
FileOutputStream fileOutputStream=new FileOutputStream(FinalFilePath);
byte data[] = new byte[1024];
while( ((count=bufferedInputStream.read(data,0,1024))!=-1) )
{
fileOutputStream.write(data,0, 1024);
}
bufferedInputStream.close();
fileOutputStream.close();
PrintLine("File has download");
And it works only for small files but as I download large files these are download but are corrupted.
After reading many questions I am also little bit confused that why everyone is coding fileOutputStream.write(data,0, 1024); to make offset to 0 and same with offset for bufferedInputStream.
I also want to know how to change that offset for BufferedInputStream and for FileOutputStream. While getting bytes in loop.
You need to write the amount that was read.
When you read into the buffer you can read fewer than 1024 bytes. For example a 1200-byte file would be read as 1024 + 176. Your count variable stores how much was actually read, which would be 176 the second time around your loop.
The reason for corruption is that you would be writing 176 'good' bytes plus (1024 - 176 = 848) additional bytes that were still in the data array from the previous read.
So try:
while( ((count=bufferedInputStream.read(data,0,1024))!=-1) )
{
fileOutputStream.write(data,0, count);
}
The zero offset in that write call is an offset into data, which you really do want to be zero. See the Javadoc for details. There is no difference for other stream types.

Stream a HTTP response in Java

I want to write the response on an HTTP request to a File. However I want to stream the response to a physical file without waiting for the entire response to be loaded.
I will actually be making a request to a JHAT server for returning all the Strings from the dump. My browser hangs before the response completes as there are 70k such objects, I wanted to write them to a file so that I can scan through.
thanks in advance,
Read a limited amount of data from the HTTP stream and write it to a file stream. Do this until all data has been handled.
Here is example code demonstrating the principle. In this example I do not deal with any i/o errors. I chose an 8KB buffer to be faster than processing one byte at a time, yet still limiting the amount of data pulled into RAM during each iteration.
final URL url = new URL("http://example.com/");
final InputStream istream = url.openStream();
final OutputStream ostream = new FileOutputStream("/tmp/data.txt");
final byte[] buffer = new byte[1024*8];
while (true) {
final int len = istream.read(buffer);
if (len <= 0) {
break;
}
ostream.write(buffer, 0, len);
}

Uploading large files to S3 with ruby (aws:s3) - connection reset by peer on UBUNTU

I am trying to store some large files on S3 using ruby aws:s3 using:
S3Object.store("video.mp4", open(file), 'bucket', :access => :public_read)
For files of 100 MB or so everything is great but with files of over 200 MB I get a "Connection reset by peer" error in the log.
Has anyone come across this weirdness? From the web, it seems to be an issue with large but I have not yet come across a definitive solution.
I am using Ubuntu.
EDIT:
This seems to be a Linux issue as suggested here.
No idea where the original problem might be, but as workaround you could try multipart upload.
filename = "video.mp4"
min_chunk_size = 5 * 1024 * 1024 # S3 minimum chunk size (5Mb)
#object.multipart_upload do |upload|
io = File.open(filename)
parts = []
bufsize = (io.size > 2 * min_chunk_size) ? min_chunk_size : io.size
while buf = io.read(bufsize)
md5 = Digest::MD5.base64digest(buf)
part = upload.add_part(buf)
parts << part
if (io.size - (io.pos + bufsize)) < bufsize
bufsize = (io.size - io.pos) if (io.size - io.pos) > 0
end
end
upload.complete(parts)
end
S3 multipart upload is little tricky as each part size must be over 5Mb, but that has been taken care of above code.

Resources