copy_stream to download file from remote URL - ruby

We wanted to download files from remote-url into memory and then upload it to some public cloud. I am planning to use copy_stream lib in ruby. However I am not sure if it can be achieved by this, because I need to also maintain the memory and CPU stats in such a way that it will not hamper the performance.
Any suggestion or example how to achieve this via copy_stream lib in ruby or do we have any other lib to achieve this considering the performance.
https://ruby-doc.org/core-2.5.5/IO.html

You can setup src/dst to be simple IO abstractions that respond to read/write:
src = IO.popen(["ssh", srchost, "cat /path/to/source_file | gzip"], "r")
dst = IO.popen(["ssh", dsthost, "gunzip > /path/to/dest_file"], "w")
IO.copy_stream(src, dst)
src.close
dst.close

Set up src to be the downloadable file.
Set up dst to be the cloud resource, with write permission.
Make sure the two are compliant with sendfile().
Sendfile is a kernel based copy stream procedure. In terms of ram use and performance, there is nothing faster. You application will not be involved with the transfer.
For sendfile(), the output socket must have zero-copy support and the input file must have mmap() support. In general, this means you have already downloaded the file to a local file, you do not change the downloaded file during the copy, and you have an open socket to the output.

Related

How can I use named pipes to stream a GCP Cloud Storage object to an executable that wants input files?

I have a third-party executable that takes a directory path as an argument and in turn looks there for a collection of .db files. I have said collection of files stored in a Google Cloud Storage bucket and would like to stream the content of those files into some local named pipes that can be used as input to the executable.
I'm writing an application to perform the above in Go and am using the "cloud.google.com/go/storage" package to work with cloud storage objects.
As a note, I need all pipes/files to be available for reading at the time I run the executable.
What is the best way to go about this? I'm looking to essentially used the named pipe as a proxy of sorts to make remote files look local to this executable. Possible?

Nifi: How to sync two directories in nifi

I have to write my response flowfiles in one directory than get data from it change it and then put it inside other dierctory i want to make this two direcotry sync(i mean that whenever i delet, or change flowfile in one directory it should change in other directories too ) I have ore than 10000 flowfiles so chechlist wouldn't be good solution. Can you reccomend me:
any contreoller service which can help me make this?
any better way i can make this task without controller service
You can use a combination of ListFile, FetchFile, and PutFile processors to detect individual file write changes within a file system directory and copy their contents to another directory. This will not detect file deletions however, so I believe a better solution is to use rsync within an ExecuteProcess processor.
To the best of my knowledge, rsync does not work on HDFS file systems, so in that case I would recommend using a tool like Helix or DistCp (I have not evaluated these tools in particular). You can either invoke them from the "command line" via ExecuteProcess or wrapping a client library in an ExecuteScript or custom processor.

How to write a stream to Google Cloud Storage?

I want to write a file in gcs with a stream object but I've only found the "create_file" function that creates a new file object by providing a path to a local file to upload and the path to store it with in the bucket.
Is there any function to create a file in gcs from a stream?
Fuse over GCS
You could try gcsfuse which layers a user-space fs over a bucket but it is only beta s/ware at present. There's a nice section on limitations which you should read first.
I use fog to access GCS but that is a thin layer which doesn't try to impose any additional semantics into the bucket/object model.
Warning, if your problem really requires a standard file-system underneath any possible solution then GCS is not a good fit.
The ability to provide an IO object instead of a File object has only recently been possible. It was added in PR 1335, and will be included in the next release.
Until then, quickest way is to write the stream to a tempfile and upload that. For more see Issue 305.

What's the best way to (programatically) determine a file's network origin?

For an application I'm writing, i want to programatically find out what computer on the network a file came from. How can I best accomplish this?
Do I need to monitor network transactions or is this data stored somewhere in Windows?
When a file is copied to the local system Windows does not keep any record of where it was copied. So unless the application that created it saved such information in the file then it will be lost.
With file auditing file and directory operations can be tracked, but I don't think that will include the source path with file copies (just who created it and when).
Yes, it seems like you would either need to detect the file transfer based on interception of network traffic, or if you have the ability to alter the file in some way, use public key cryptography to sign files using a machine-specific key before they are transferred.
Create a service on either the destination computer, or on the file hosting computers which will add records to an Alternate Data Stream attached to each file, much the way that Windows handles ZoneInfo for files downloaded from the internet.
You can have a background process on machine A which "tags" each file as having been tagged by machine A on such-and-such a date and time. Then when machine B downloads the file, assuming we are using NTFS filesystems, it can see the tag from A. Or, if you can't have a process at the server, you can use NTFS streams on the "client" side via packet sniffing methods as others have described. The bonus here is that future file-copies will retain the data as long as it is between NTFS systems.
Alternative: create a requirement that all file transfers must be done through a Web portal (as opposed to network drag-and-drop). Built in logging. Or other type of file retrieval proxy. Do you have control over procedures such as this?

Efficiently creating tar files

Note: I'm using Windows file servers and .NET
If I were to create a TAR file from files on a remote file server (meaning, the TAR file would be created on the remote file server, where the original files are), would the bytes need to come to my machine and then go back to the file server (since my machine is running the code that's generating the TAR), or would they stay on the file server? I'm asking about the best possible (theoretical) implementation.
Thank you!
The bytes need to be where they are processed.
If you process them on your remote system, they must be transferred.
If you process them on your server, they don't need to be transferred.
If your goal is to minimize bandwidth usage, your best bet would be to have a script on your server that will generate the tar files for you when triggered by your remote system.
The best possible implementation really depends on what your goals and constraints are.
The bytes would have to be read into your machine. The only way I know that you can just do the TARing on the remote server is to have the remote server generate the TAR. For example, you could connect via SSH and run a shell command on the remote server.
Unfortunately, in the scenario described, the TAR operation will use network bandwidth. You need to run the tar program on the file server to avoid using bandwidth.

Resources