How to execute a bash file containing curl instructions within a Google storage bucket and directly copy the contents to the bucket? - bash

I have a bash script file where each line is a curl command to download a file. This bash file is in a Google bucket.
I would like to execute the file either directly from the storage and copy its downloaded contents there or execute it locally and directly copy its content to the bucket.
Basically, I do not want to have these fils on my local machine.. I have tried things along these lines but it either failed or simply downloaded everything locally.
gsutil cp gs://bucket/my_file.sh - | bash gs://bucket/folder_to_copy_to/
Thank you!

To do so, the bucket needs to be mounted on the pod (the pod would see it as a directory).
If the bucket supports NFS, you would be able to mount it as shown here.
Also, there is another way as shown in this question.
otherwise, you would need to copy the script to the pod, run it, then upload the generated files to the bucket, and lastly clean everything up.
The better option is to use a filestore which can be easily mounted using CSI drivers as mentioned here.

Related

how to set watermark when downloading from s3 bucket

I have one s3 bucket which contains zip files.
I have a script which downloading the zip files to my local server, unzip them and upload them to another s3 bucket.
How can I set watermark so i will know what was the last file i downloaded so i won't need to save all the files locally or download all the files each time the script is running ?
I'm using aws sync command which as much as i understand should copy only new files, am i right ?
aws s3 sync $gcs3$gcRegion/$gcTech/$gcPrinterFamily/$gcPrinterType/$gcPrinterName/ $dir
The AWS Command-Line Interface (CLI) aws s3 sync command will copy any files that are not present in the destination (sort of).
So, you either need to keep all previously-downloaded files, or you need another way to keep track of the files that were downloaded.
Instead, I would recommend writing your own program that:
Downloads all files from the S3 bucket with a LastModified timestamp after a stored timestamp
Stores the current time
Unzips the files and copies them to the other S3 bucket
Deletes the zip files and unzipped files
So, the program will need to remember the last time it downloaded files, but it will not need to remember which files it downloaded. Be careful — S3 stores time in UTC, so you'll need to convert your timezones. Or, simply remember the highest LastModified value of the files you downloaded.
To obtain a list of files since a certain LastModified date, you could use the AWS CLI:
aws s3api list-objects --bucket jstack-a --query "Contents[?LastModified>='2019-04-11'].[Key]" --output text
However, I would recommend writing a Python program for the above activities, since it would be easier that writing command-line scripts.

After copying files using gsutil, they are not deleted instantly from the local storage

My task is to upload CSV files from the local database to the Google Cloud storage.
To do this, I first copy them to my desktop and then upload them to the Google Cloud storage.
I want this to be done automatically, without my participation. Therefore, I created a CMD file that will be run by Task Scheduler. The structure of the CMD file is the next:
gsutil cp C:\Users\Myname\Desktop\test\*.csv gs://my-bucket
gsutil rm C:\Users\Myname\Desktop\test\*.csv
But after loading data into `Google Cloud storage, it does not delete the CSV files.
However, if you run the delete in a separate command, it successfully deletes files.
Just:
gsutil rm C:\Users\Myname\Desktop\test\*.csv
But I want the download and removal code to be in one file.
I also tried this way (but it did not help me either):
gsutil cp C:\Users\Myname\Desktop\test\*.csv gs://my-bucket
del C:\Users\Myname\Desktop\test\*.csv
What are the solutions to this problem?
The gsutil mv command is designed for this use case.
Note, however, the docs section about atomicity. Especially with moving from your local filesystem to the cloud, there is no way to upload and delete atomically, so the command will first upload, verify the file is stored in the cloud, and then delete the local file.
The problem is cause by gsutil being a script. On Windows, this script (gsutil) exits and stops further processing of commands in your batch file.
The solution is to add the word call in front of gsutil:
call gsutil cp C:\Users\Myname\Desktop\test\*.csv gs://my-bucket
Next, do not use gsutil to delete a local file. Use del instead.

In what folder/directory does Minio store its objects/files on local disk?

I'm running a Minio instance on my local machine, on http://localhost:9000. I created a bucket called foo, and uploaded a single file test.txt to it.
Since this Minio instance is running on my local machine, where is the "uploaded" binary test.txt file actually stored in my local filesystem? That is, is it possible to locate this test.txt file in a folder somewhere using the GUI filesystem browser? For example, /Users/myusername/some/hidden/folder/minio/buckets/foo/test.txt
I looked through the docs and it only has information about using the mc command via mc ls to view the files in the "remote" Minio machine via the command line. This is not what I'm looking for, I want to see under what folder the test.txt file actually resides on my local machine. Is this possible, or am I forced to only view an abstraction of the test.txt through the mc command?
The file is stored in the mount you pass to start minio server.
For example if you started minio by minio server /tmp, create a bucket foo, and upload object bar.txt. The file will be present at /tmp/foo/bar.txt

How to place a file directly in HDFS without using local by directly download a file from a webpage?

I need some help. I am downloading a file from a webpage using python code and placing it in local file system and then transferring it into HDFS using put command and then performing operations on it.
But there might be some situations where the file size will be very large and downloading into Local File System is not a right procedure. So I want the file to be directly be downloaded into HDFS with out using the local file system at all.
Can any one suggest me some methods which one would be the best method to proceed?
If there are any errors in my question please correct me.
You can pipe it directly from a download to avoid writing it to disk, e.g.:
curl server.com/my/file | hdfs dfs -put - destination/file
The - parameter to -put tells it to read from stdin (see the documentation).
This will still route the download through your local machine, though, just not through your local file system. If you want to download the file without using your local machine at all, you can write a map-only MapReduce job whose tasks accept e.g. an input file containing a list of files to be downloaded and then download them and stream out the results. Note that this will require your cluster to have open access to the internet which is generally not desirable.

aws s3 cp to a local file was not replacing file

I have a shell script that is running aws s3 cp s3://s3file /home/usr/localfile. The file already exists in that directory, so the cp command is essentially getting the latest copy from S3 to get the latest version.
However, I noticed today that the file was not the latest version; it didn't match the file on S3. Looking at the shell script's stdout from the last two runs, it looks like the command ran - the output is: download: s3://s3file to usr/localfile. But when I compared the copies, they didn't match. The changed timestamp on the file when I view it on the local machine via WinSCP (a file transfer client) didn't change either
I manually ran the command in a shell just now and it copied the file from S3 to the local machine and successfully got the latest copy.
Do I need to add a specific option for this, or is it typical behavior for files to not override a file after aws s3 cp?

Resources