Move files under GCS with renaming - bash

I want to write the following bash script which copies files from one GCS bucket to another with renaming options.
My input folder is gs://test-rtt-integration/result/frd/*.orc
and my destination folder is gs://test-rtt-integration/recent_files/frd
The renaming of the copied file should be done based on the name provided from gs://test-rtt-integration/complex-files/TAN/recent_files/today/frd
once the copy with renaming is done I need to clean gs://test-rtt-integration/result/frd
I tested the following commands, but they are not working properly
NAME = "$(gsutil ls gs://test-rtt-integration/complex-files/TAN/recent_files/today/frd)"
gsutil mv gs://test-rtt-integration/result/frd/*.orc gs://test-rtt-integration/recent_files/frd/$NAME
gsutil rm -rf gs://test-rtt-integration/result/frd
( all .orc files and other files should be deleted)
But this is not working properly as I have to split the NAME based on / and get the last split , so if the result of split is called SPLIT , I have to do gsutil mv gs://test-rtt-integration/result/frd/*.orc gs://test-rtt-integration/recent_files/frd/$SPLIT
Any idea on how to do this?

The question is a little bit confusing. You say that you want to move files from one Google Cloud Storage bucket to another, but all the operations are made in one single bucket called test-rtt-integration.
However, as soon as you get the file location with the command gsutil ls gs://[BUCKET_NAME]/folder e.g. gs://[BUCKET_NAME]/folder/[FILENAME].orc, since the gs://[BUCKET_NAME]/folder/ part is always the same for all the objects in the folder, just replace it with null and you will get only the object name at the end as [FILENAME].orc etc.
I am not sure if this is exactly what you are looking for, but I did a little bit of coding myself and I have created a bash script that:
Gets the name of each object from gs://[BUCKET_NAME]/from bucket folder
Copy all objects from gs://[BUCKET_NAME]/from bucket folder to the gs://[BUCKET_NAME]/to/ bucket folder
Delete all objects from gs://[BUCKET_NAME]/from bucket folder
Inside there are comments that explain how every operation works in details. If that is not exactly what you are looking for, you can get the basic idea of how that works and implement it in different way that will suit you better. I have tested the scrip myself in Google Cloud Shell and it is working. The example code can be found in GitHub.

Related

Temp file not being deleted

I'm trying to create a temporary file in my pipeline, then use that file in another rule.
For example, I have two rules in a .smk file:
#Unzip adapter trimmed fastq file
rule unzip_fastq:
input:
'{sample}.adapterTrim.round2.fastq.gz',
output:
temp('{sample}.adapterTrim.round2.fastq')
conda:
'../envs/rep_element.yaml'
shell:
'gunzip -c {input[0]} > {output[0]}'
#Run bowtie2 to align to rep elements and parse output
rule parse_bowtie2_output_realtime:
input:
'{sample}.adapterTrim.round2.fastq'
output:
'rep_element_pipeline/{sample}.fastq.gz.mapped_vs_' + config["ref"]["bt2_index"] + '.sam'
params:
bt2=config["ref"]["bt2_index_path"], eid=config["ref"]["enst2id"]
conda:
'../envs/rep_element.yaml'
shell:
'perl ../scripts/parse_bowtie2_output_realtime_includemultifamily.pl '
'{input[0]} {params.bt2} {output[0]} {params.eid}'
{sample}.adapterTrim.round2.fastq is used once and should ultimately be deleted upon completion. However, I'm finding that this file is uploaded to Amazon S3, even with the addition of temp(). I'm also finding that this file is removed locally, but still persists on S3.
Am I doing this correctly? '{sample}.adapterTrim.round2.fastq' is not currently written in the rule-all of the Snakefile.
We ultimately need to prevent this file from being uploaded to S3, so if there is a way to specify not to upload this file in the rule, that would be useful.
It seems that the snippet in the question is not consistent with actual use, since for S3 files one would need to wrap file names in remote.
However, as a general solution, documentation contains the following:
The remote() wrapper is mutually-exclusive with the temp() and protected() wrappers.
Hence, if you intend to use a temp file, make sure it's not wrapped in remote, or explicitly wrap the file in local.

Bash get all specific files in specific directory

I have a script that takes as an argument a path to a file upon which it performs certain operations. These files are stored in directories with path storage///_id/files (so in 2016 July 22 it would be storage/2016/Jul/22_1/files for the first set of files, .../Jul/22_2/files for second one etc.). The problem is each directory stores files with two extensions (say file.doc, file.txt) and I want to perform operations only on .txt files. I've tested earlier something like
for file in "/home/gonczor/temp/"*/*".txt"; do
echo "$file"
done
And it worked perfectly given that names in directories don't change. When I move one step further and add this 22_1, 22_2, 23_1 directories something strange happens.
This is my script (simplified):
for file in "$FILE_PATH/""$YEAR/""$MONTH/""$DAY"*/*".txt"; do
my_program ${report}
done
And instead of finding .../2016/Jul/22_1/file.txt it finds /2016/Jul/22*/*.txt
How can I make it work? The solution I've tried to make up is from here

Extracting contents of many zipped folders into a single directory

Kind of easy question, but I can't find the answer. I want to extract the contents of multiple zipped folders into a single directory. I am using the bash console, which is the only tool available on the particular website I am using.
For example, I have two folders: a.zip (which contains a1.txt and a2.txt) and b.zip (which contains b1.txt and b2.txt). I want to get extract all four text files into a single directory.
I have tried
unzip \*.zip -d \newdirectory
But it creates two directories (a and b) with two text files in each.
I also tried concatenating the two zipped folders into one big folder and extracting it, but it still creates two directories, even when I specify a new directory.
I can't figure what I am doing wrong. Any help?
Thanks in advance!
Use the -j parameter to ignore any directory structure.
unzip -j -d /path/to/your/directory '*.zip*'

How to restore a folder structure 7Zip'd with split volume option?

I 7Zip'd a multi-gig folder which contained many folders each with many files using the split to volumes (9Meg) option. 7Zip created files of type .zip.001,
.zip.002, etc. When I extract .001 it appears to work correctly but I get an 'unexpected end of data' error. 7Zip does not automatically go to .002. When I extract .002, it also gives the same error and it does not continue the original folder/file structure. Instead it extracts a zip file in the same folder as the previously extracted files. How do I properly extract split files to obtain the original folder/file structure? Thank you.

s3cmd sync is remote copying the wrong files to the wrong locations

I've got the following as part of a shell script to copy site files up to a S3 CDN:
for i in "${S3_ASSET_FOLDERS[#]}"; do
s3cmd sync -c /path/to/.s3cfg --recursive --acl-public --no-check-md5 --guess-mime-type --verbose --exclude-from=sync_ignore.txt /path/to/local/${i} s3://my.cdn/path/to/remote/${i}
done
Say S3_ASSET_FOLDERS is:
("one/" "two/")
and say both of those folders contain a file called... "script.js"
and say I've made a change to two/script.js - but not touched one/script.js
running the above command will firstly copy the file from /one/ to the correct location, although I've no idea why it thinks it needs to:
INFO: Sending file
'/path/to/local/one/script.js', please wait...
File
'/path/to/local/one/script.js'
stored as
's3://my.cdn/path/to/remote/one/script.js' (13551
bytes in 0.1 seconds, 168.22 kB/s) [1 of 0]
... and then a remote copy operation for the second folder:
remote copy: two/script.js -> script.js
What's it doing? Why?? Those files aren't even similar. Different modified times, different checksums. No relation.
And I end up with an s3 bucket with two incorrect files in. The file in /two/ that should have been updated, hasn't. And the file in /one/ that shouldn't have changed is now overwritten with the contents of /two/script.js
Clearly I'm doing something bizarrely stupid because I don't see anyone else having the same issue. But I've no idea what??
First of all, try to run it without --no-check-md5 option.
Second, I suggest you to pay attention to directory names, specifically trailing slashes.
s3cmd documentation says:
With directories there is one thing to watch out for – you can either upload the directory and its contents or just the contents. It all depends on how you specify the source.
To upload a directory and keep its name on the remote side specify the source without the trailing slash
On the other hand to upload just the contents, specify the directory it with a trailing slash

Resources