I have a very large xml file on s3 (50gb). I would like to stream this file to a sax xml parser for further processing using ruby. How would I do that in an environment where I cannon download the whole file locally, but only stream it over tcp from s3?
I'm thinking about using https://github.com/ohler55/ox for the parsing it self, and https://github.com/aws/aws-sdk-ruby for accessing the file on S3. I'm just unsure how connect the pieces using a streaming approach?
The most easiest way is to use mc. mc implements are cat command which can used in a simpler way.
For example as shown below. Here cat streams your object and pipe the output of cat to your XML parser which reads from stdinput.
$ mc cat s3.amazonaws.com/<yourbucket>/<yourobject> | <your_xml_parser>
This way you can avoid downloading the file locally.
Additionally mc provides more tools to work with Amazon S3 compatible cloud storage and filesystems. It has features like resumable uploads, progress bar, parallel copy. mc is written in Golang and released under Apache license v2. mc is supported on OS X, Linux and Windows.
Related
We wanted to download files from remote-url into memory and then upload it to some public cloud. I am planning to use copy_stream lib in ruby. However I am not sure if it can be achieved by this, because I need to also maintain the memory and CPU stats in such a way that it will not hamper the performance.
Any suggestion or example how to achieve this via copy_stream lib in ruby or do we have any other lib to achieve this considering the performance.
https://ruby-doc.org/core-2.5.5/IO.html
You can setup src/dst to be simple IO abstractions that respond to read/write:
src = IO.popen(["ssh", srchost, "cat /path/to/source_file | gzip"], "r")
dst = IO.popen(["ssh", dsthost, "gunzip > /path/to/dest_file"], "w")
IO.copy_stream(src, dst)
src.close
dst.close
Set up src to be the downloadable file.
Set up dst to be the cloud resource, with write permission.
Make sure the two are compliant with sendfile().
Sendfile is a kernel based copy stream procedure. In terms of ram use and performance, there is nothing faster. You application will not be involved with the transfer.
For sendfile(), the output socket must have zero-copy support and the input file must have mmap() support. In general, this means you have already downloaded the file to a local file, you do not change the downloaded file during the copy, and you have an open socket to the output.
I have a third-party executable that takes a directory path as an argument and in turn looks there for a collection of .db files. I have said collection of files stored in a Google Cloud Storage bucket and would like to stream the content of those files into some local named pipes that can be used as input to the executable.
I'm writing an application to perform the above in Go and am using the "cloud.google.com/go/storage" package to work with cloud storage objects.
As a note, I need all pipes/files to be available for reading at the time I run the executable.
What is the best way to go about this? I'm looking to essentially used the named pipe as a proxy of sorts to make remote files look local to this executable. Possible?
I want to write a file in gcs with a stream object but I've only found the "create_file" function that creates a new file object by providing a path to a local file to upload and the path to store it with in the bucket.
Is there any function to create a file in gcs from a stream?
Fuse over GCS
You could try gcsfuse which layers a user-space fs over a bucket but it is only beta s/ware at present. There's a nice section on limitations which you should read first.
I use fog to access GCS but that is a thin layer which doesn't try to impose any additional semantics into the bucket/object model.
Warning, if your problem really requires a standard file-system underneath any possible solution then GCS is not a good fit.
The ability to provide an IO object instead of a File object has only recently been possible. It was added in PR 1335, and will be included in the next release.
Until then, quickest way is to write the stream to a tempfile and upload that. For more see Issue 305.
I know Spring has MultipartFile component.
I am wondering if there is any API to unzip files or read zip files to do some processing?
I have a zip file that following a certain format.
photos\
audio\
report.xml
when the user upload it via web, I wish to scan the zip file and do some processing.
Is there a solution for this issue?
I do not know spring have any such type of API,
but you can use other API for ZIP or UNZIP files.
1) http://commons.apache.org/compress/
2) java.util.zip
and also see
What is a good Java library to zip/unzip files?
There are a couple of Java SE APIs for reading ZIP files:
java.util.zip.ZipInputStream - gives you a one-pass reader
java.util.zip.ZipFile - gives you a reader that allows you to read the entries and the files in any order.
You should be able to use one or the other of these, depending on the nature of your processing.
If the processing requires the images to be in actual files, you would have to create the directories and write the files yourself. In this case, it would probably be simpler to use an external command to do the ZIP extraction.
I am looking for a way to dynamically stream download a zip of files from Amazon S3.
The application is hosted on EC2 and the files are stored on S3.
Need to give users the ability to select from a group of files which will then get bundled up and downloaded to them.
Have heard about a few Actionscript libraries (aszip and fzip) that might be possible, or could do this in Ruby, or even possibly PHP.
The files do not need any compression, zip is just being used to bundle the files up into one single download....
I use Nginx Zip Module to stream local files, but there is option to stream from remote locations. Otherwise you could use it with VFS mounted S3 storage as local filesystem.
It supports seek - resumable and accelerated downloads
If you can use Mono, DotNetZip will do it.
Response.Clear();
Response.BufferOutput= false; // necessary for chunked output
String ReadmeText= "This content goes into an entry in the " +
"zip file. Timestamp, MD5, whatever." ;
string archiveName= String.Format("archive-{0}.zip", DateTime.Now.ToString("yyyy-MMM-dd-HHmmss"));
Response.ContentType = "application/zip";
Response.AddHeader("content-disposition", "filename=" + archiveName);
using (ZipFile zip = new ZipFile())
{
zip.AddEntry("Readme.txt", "", ReadmeText, Encoding.Default);
zip.AddFiles(filesToInclude, "files");
zip.Save(Response.OutputStream);
}
HttpContext.Current.ApplicationInstance.CompleteRequest();
DotNetZip is open source, free to use.
Java supports streaming zips too. take a look at the java.utils.zip package. i used that to implement a pipline consisting of FTP, UNZIP, XSLT, CSV units. it works like a charm.
Martin