s3cmd sync is remote copying the wrong files to the wrong locations - bash

I've got the following as part of a shell script to copy site files up to a S3 CDN:
for i in "${S3_ASSET_FOLDERS[#]}"; do
s3cmd sync -c /path/to/.s3cfg --recursive --acl-public --no-check-md5 --guess-mime-type --verbose --exclude-from=sync_ignore.txt /path/to/local/${i} s3://my.cdn/path/to/remote/${i}
done
Say S3_ASSET_FOLDERS is:
("one/" "two/")
and say both of those folders contain a file called... "script.js"
and say I've made a change to two/script.js - but not touched one/script.js
running the above command will firstly copy the file from /one/ to the correct location, although I've no idea why it thinks it needs to:
INFO: Sending file
'/path/to/local/one/script.js', please wait...
File
'/path/to/local/one/script.js'
stored as
's3://my.cdn/path/to/remote/one/script.js' (13551
bytes in 0.1 seconds, 168.22 kB/s) [1 of 0]
... and then a remote copy operation for the second folder:
remote copy: two/script.js -> script.js
What's it doing? Why?? Those files aren't even similar. Different modified times, different checksums. No relation.
And I end up with an s3 bucket with two incorrect files in. The file in /two/ that should have been updated, hasn't. And the file in /one/ that shouldn't have changed is now overwritten with the contents of /two/script.js
Clearly I'm doing something bizarrely stupid because I don't see anyone else having the same issue. But I've no idea what??

First of all, try to run it without --no-check-md5 option.
Second, I suggest you to pay attention to directory names, specifically trailing slashes.
s3cmd documentation says:
With directories there is one thing to watch out for – you can either upload the directory and its contents or just the contents. It all depends on how you specify the source.
To upload a directory and keep its name on the remote side specify the source without the trailing slash
On the other hand to upload just the contents, specify the directory it with a trailing slash

Related

Dowloading files as a single .zip on windows server

a client have a download area where users can download or browse single files. Files are divided in folder, so there are documents, catalogues, newsletter and so on, and their extension can vary: they can be .pdf, .ai or simple .jpeg. He asked me if I can provide a link to download every item in a specific folder as a big, compressed file. Problem is, I'm on a Windows server, so I'm a bit clueless if there's a way. I can edit pe pages of this area, so I can include jquery and scripts with a little freedom. Any hint?
Windows archiver is TAR and you are needing to build a TARbALL (Historically all related files in one Tape ARchive)
I have a file server which is mapped as S:\ (it does not have TAR command, and Tar cannot use URL but can use device:)
For any folders contents (including sub folders) it is easy to remotely save all current files in a zip with a single command (for multiple root locations they need a loop or a list)
It will build the Tape Archive as a windows.zip using the -a (auto) switch but you need to consider the desired level of nesting by collect all contents at the desired root location.
TAR -a[other options] file.zip [folder / files]
Points to watch out for
ensure here is not an older archive
it will comment error/warnings like the two given during run, however, should complete without fail.
Once you have the zip file you can offer post as a web asset such as
<a href="\\server\folder\all.zip" download="all.zip">Get All<a>
for other notes see https://stackoverflow.com/a/68728992/10802527

Temp file not being deleted

I'm trying to create a temporary file in my pipeline, then use that file in another rule.
For example, I have two rules in a .smk file:
#Unzip adapter trimmed fastq file
rule unzip_fastq:
input:
'{sample}.adapterTrim.round2.fastq.gz',
output:
temp('{sample}.adapterTrim.round2.fastq')
conda:
'../envs/rep_element.yaml'
shell:
'gunzip -c {input[0]} > {output[0]}'
#Run bowtie2 to align to rep elements and parse output
rule parse_bowtie2_output_realtime:
input:
'{sample}.adapterTrim.round2.fastq'
output:
'rep_element_pipeline/{sample}.fastq.gz.mapped_vs_' + config["ref"]["bt2_index"] + '.sam'
params:
bt2=config["ref"]["bt2_index_path"], eid=config["ref"]["enst2id"]
conda:
'../envs/rep_element.yaml'
shell:
'perl ../scripts/parse_bowtie2_output_realtime_includemultifamily.pl '
'{input[0]} {params.bt2} {output[0]} {params.eid}'
{sample}.adapterTrim.round2.fastq is used once and should ultimately be deleted upon completion. However, I'm finding that this file is uploaded to Amazon S3, even with the addition of temp(). I'm also finding that this file is removed locally, but still persists on S3.
Am I doing this correctly? '{sample}.adapterTrim.round2.fastq' is not currently written in the rule-all of the Snakefile.
We ultimately need to prevent this file from being uploaded to S3, so if there is a way to specify not to upload this file in the rule, that would be useful.
It seems that the snippet in the question is not consistent with actual use, since for S3 files one would need to wrap file names in remote.
However, as a general solution, documentation contains the following:
The remote() wrapper is mutually-exclusive with the temp() and protected() wrappers.
Hence, if you intend to use a temp file, make sure it's not wrapped in remote, or explicitly wrap the file in local.

Move files under GCS with renaming

I want to write the following bash script which copies files from one GCS bucket to another with renaming options.
My input folder is gs://test-rtt-integration/result/frd/*.orc
and my destination folder is gs://test-rtt-integration/recent_files/frd
The renaming of the copied file should be done based on the name provided from gs://test-rtt-integration/complex-files/TAN/recent_files/today/frd
once the copy with renaming is done I need to clean gs://test-rtt-integration/result/frd
I tested the following commands, but they are not working properly
NAME = "$(gsutil ls gs://test-rtt-integration/complex-files/TAN/recent_files/today/frd)"
gsutil mv gs://test-rtt-integration/result/frd/*.orc gs://test-rtt-integration/recent_files/frd/$NAME
gsutil rm -rf gs://test-rtt-integration/result/frd
( all .orc files and other files should be deleted)
But this is not working properly as I have to split the NAME based on / and get the last split , so if the result of split is called SPLIT , I have to do gsutil mv gs://test-rtt-integration/result/frd/*.orc gs://test-rtt-integration/recent_files/frd/$SPLIT
Any idea on how to do this?
The question is a little bit confusing. You say that you want to move files from one Google Cloud Storage bucket to another, but all the operations are made in one single bucket called test-rtt-integration.
However, as soon as you get the file location with the command gsutil ls gs://[BUCKET_NAME]/folder e.g. gs://[BUCKET_NAME]/folder/[FILENAME].orc, since the gs://[BUCKET_NAME]/folder/ part is always the same for all the objects in the folder, just replace it with null and you will get only the object name at the end as [FILENAME].orc etc.
I am not sure if this is exactly what you are looking for, but I did a little bit of coding myself and I have created a bash script that:
Gets the name of each object from gs://[BUCKET_NAME]/from bucket folder
Copy all objects from gs://[BUCKET_NAME]/from bucket folder to the gs://[BUCKET_NAME]/to/ bucket folder
Delete all objects from gs://[BUCKET_NAME]/from bucket folder
Inside there are comments that explain how every operation works in details. If that is not exactly what you are looking for, you can get the basic idea of how that works and implement it in different way that will suit you better. I have tested the scrip myself in Google Cloud Shell and it is working. The example code can be found in GitHub.

Update using rsync and remove from the source folder

I want to rsync contents from /local/path to server:/remote/path.
The files end with extensions composed by 4 digits
If a file does not exist in remote path, copy the file to remote and remove from local
If a file exists in remote path and the size is no less than the local one, do not copy the file to remote and remove it from local
I tried
rsync -avmhP --include='*.[0-9][0-9][0-9][0-9]' --include='*/' --exclude='*' --size-only --remove-source-files /local/path server:/remote/path
However, some files existing in the remote path remain in local path.
Another question is, why we need --include='*/' --exclude='*'? Why --include='*.[0-9][0-9][0-9][0-9]' alone doesn't work for the file filtering?
Do you mean --remove-sent-file instead of remove-source-file ?
According to the rsync man page :
--remove-sent-file
This tells rsync to remove from the sending side the files and/or symlinks that are newly created or whose content is updated on the receiving side. Directories and devices are not removed, nor are files/symlinks whose attributes are merely changed.
That's means that only transferred file (the ones whom size changed) are deleted from source. To active the include file, you first need to exclude all the other BUT my include pattern. The 3 arguments you used mean "I excluded all files (--include='*/' --exclude='*') but the ones matching my pattern (--include='*.[0-9]{4}')
From man page :
--include=PATTERN
don’t exclude files matching PATTERN
--exclude=PATTERN
exclude files matching PATTERN

How to restore a folder structure 7Zip'd with split volume option?

I 7Zip'd a multi-gig folder which contained many folders each with many files using the split to volumes (9Meg) option. 7Zip created files of type .zip.001,
.zip.002, etc. When I extract .001 it appears to work correctly but I get an 'unexpected end of data' error. 7Zip does not automatically go to .002. When I extract .002, it also gives the same error and it does not continue the original folder/file structure. Instead it extracts a zip file in the same folder as the previously extracted files. How do I properly extract split files to obtain the original folder/file structure? Thank you.

Resources