Temp file not being deleted - bioinformatics

I'm trying to create a temporary file in my pipeline, then use that file in another rule.
For example, I have two rules in a .smk file:
#Unzip adapter trimmed fastq file
rule unzip_fastq:
input:
'{sample}.adapterTrim.round2.fastq.gz',
output:
temp('{sample}.adapterTrim.round2.fastq')
conda:
'../envs/rep_element.yaml'
shell:
'gunzip -c {input[0]} > {output[0]}'
#Run bowtie2 to align to rep elements and parse output
rule parse_bowtie2_output_realtime:
input:
'{sample}.adapterTrim.round2.fastq'
output:
'rep_element_pipeline/{sample}.fastq.gz.mapped_vs_' + config["ref"]["bt2_index"] + '.sam'
params:
bt2=config["ref"]["bt2_index_path"], eid=config["ref"]["enst2id"]
conda:
'../envs/rep_element.yaml'
shell:
'perl ../scripts/parse_bowtie2_output_realtime_includemultifamily.pl '
'{input[0]} {params.bt2} {output[0]} {params.eid}'
{sample}.adapterTrim.round2.fastq is used once and should ultimately be deleted upon completion. However, I'm finding that this file is uploaded to Amazon S3, even with the addition of temp(). I'm also finding that this file is removed locally, but still persists on S3.
Am I doing this correctly? '{sample}.adapterTrim.round2.fastq' is not currently written in the rule-all of the Snakefile.
We ultimately need to prevent this file from being uploaded to S3, so if there is a way to specify not to upload this file in the rule, that would be useful.

It seems that the snippet in the question is not consistent with actual use, since for S3 files one would need to wrap file names in remote.
However, as a general solution, documentation contains the following:
The remote() wrapper is mutually-exclusive with the temp() and protected() wrappers.
Hence, if you intend to use a temp file, make sure it's not wrapped in remote, or explicitly wrap the file in local.

Related

dockerfile copy list of files, when list is taken from a local file

I've got a file containing a list of paths that I need to copy by Dockerfile's COPY command on docker build.
My use case is such: I've got a python requirements.txt file, when inside I'm calling multiple other requirements files inside the project, with -r PATH.
Now, I want to docker COPY all the requirements files alone, run pip install, and then copy the rest of the project (for cache and such). So far i haven't managed to do so with docker COPY command.
No need of help on fetching the paths from the file - I've managed that - just if it is possible to be done - how?
thanks!
Not possible in the sense that the COPY directive allows it out of the box, however if you know the extensions you can use a wildcard for the path such as COPY folder*something*name somewhere/.
For simple requirements.txt fetching that could be:
# but you need to distinguish it somehow
# otherwise it'll overwrite the files and keep the last one
# e.g. rename package/requirements.txt to package-requirements.txt
# and it won't be an issue
COPY */requirements.txt ./
RUN for item in $(ls requirement*);do pip install -r $item;done
But if it gets a bit more complex (as in collecting only specific files, by some custom pattern etc), then, no. However for that case simply use templating either by a simple F-string, format() function or switch to Jinja, create a Dockerfile.tmpl (or whatever you'd want to name a temporary file), then collect the paths, insert into the templated Dockerfile and once ready dump to a file and execute afterwards with docker build.
Example:
# Dockerfile.tmpl
FROM alpine
{{replace}}
# organize files into coherent structures so you don't have too many COPY directives
files = {
"pattern1": [...],
"pattern2": [...],
...
}
with open("Dockerfile.tmpl", "r") as file:
text = file.read()
insert = "\n".join([
f"COPY {' '.join(values)} destination/{key}/"
for key, values in files.items()
])
with open("Dockerfile", "w") as file:
file.write(text.replace("{{replace}}", insert))
You might want to do this for example:
FROM ...
ARG files
COPY files
and run with
docker build -build-args items=`${cat list_of_files_to_copy.txt}`

Move files under GCS with renaming

I want to write the following bash script which copies files from one GCS bucket to another with renaming options.
My input folder is gs://test-rtt-integration/result/frd/*.orc
and my destination folder is gs://test-rtt-integration/recent_files/frd
The renaming of the copied file should be done based on the name provided from gs://test-rtt-integration/complex-files/TAN/recent_files/today/frd
once the copy with renaming is done I need to clean gs://test-rtt-integration/result/frd
I tested the following commands, but they are not working properly
NAME = "$(gsutil ls gs://test-rtt-integration/complex-files/TAN/recent_files/today/frd)"
gsutil mv gs://test-rtt-integration/result/frd/*.orc gs://test-rtt-integration/recent_files/frd/$NAME
gsutil rm -rf gs://test-rtt-integration/result/frd
( all .orc files and other files should be deleted)
But this is not working properly as I have to split the NAME based on / and get the last split , so if the result of split is called SPLIT , I have to do gsutil mv gs://test-rtt-integration/result/frd/*.orc gs://test-rtt-integration/recent_files/frd/$SPLIT
Any idea on how to do this?
The question is a little bit confusing. You say that you want to move files from one Google Cloud Storage bucket to another, but all the operations are made in one single bucket called test-rtt-integration.
However, as soon as you get the file location with the command gsutil ls gs://[BUCKET_NAME]/folder e.g. gs://[BUCKET_NAME]/folder/[FILENAME].orc, since the gs://[BUCKET_NAME]/folder/ part is always the same for all the objects in the folder, just replace it with null and you will get only the object name at the end as [FILENAME].orc etc.
I am not sure if this is exactly what you are looking for, but I did a little bit of coding myself and I have created a bash script that:
Gets the name of each object from gs://[BUCKET_NAME]/from bucket folder
Copy all objects from gs://[BUCKET_NAME]/from bucket folder to the gs://[BUCKET_NAME]/to/ bucket folder
Delete all objects from gs://[BUCKET_NAME]/from bucket folder
Inside there are comments that explain how every operation works in details. If that is not exactly what you are looking for, you can get the basic idea of how that works and implement it in different way that will suit you better. I have tested the scrip myself in Google Cloud Shell and it is working. The example code can be found in GitHub.

sql loader without .dat extension

Oracle's sqlldr defaults to a .dat extension. That I want to override. I don't like to rename the file. When googled get to know few answers to use . like data='fileName.' which is not working. Share your ideas, please.
Error message is fileName.dat is not found.
Sqlloder has default extension for all input files data,log,control...
data= .dat
log= .log
control = .ctl
bad =.bad
PARFILE = .par
But you have to pass filename without apostrophe and dot
sqlloder pass/user#db control=control data=data
sqloader will add extension. control.ctl data.dat
Nevertheless i do not understand why you do not want to specify extension?
You can't, at least in Unix/Linux environments. In Windows you can use the trailing period trick, specifying either INFILE 'filename.' in the control file or DATA=filename. on the command line. WIndows file name handling allows that; you can for instance do DIR filename. at a command prompt and it will list the file with no extension (as will DIR filename). But you can't do that with *nix, from a shell prompt or anywhere else.
You said you don't want to copy or rename the file. Temporarily renaming it might be the simplest solution, but as you may have a reason not to do that even briefly you could instead create a hard or soft link to the file which does have an extension, and use that link as the target instead. You could wrap that in a shell script that takes the file name argument:
# set variable from correct positional parameter; if you pass in the control
# file name or other options, this might not be $1 so adjust as needed
# if the tmeproary file won't be int he same directory, need to be full path
filename=$1
# optionally check file exists, is readable, etc. but overkill for demo
# can also check temporary file does not already exist - stop or remove
# create soft link somewhere it won't impact any other processes
ln -s ${filename} /tmp/${filename##*/}.dat
# run SQL*Loader with soft link as target
sqlldr user/password#db control=file.ctl data=/tmp/${filename##*/}.dat
# clean up
rm -f /tmp/${filename##*/}.dat
You can then call that as:
./scriptfile.sh /path/to/filename
If you can create the link in the same directory then you only need to pass the file, but if it's somewhere else - which may be necessary depending on why renaming isn't an option, and desirable either way - then you need to pass the full path of the data file so the link works. (If the temporary file will be int he same filesystem you could use a hard link, and you wouldn't have to pass the full path then either, but it's still cleaner to do so).
As you haven't shown your current command line options you may have to adjust that to take into account anything else you currently specify there rather than in the control file, particularly which positional argument is actually the data file path.
I have the same issue. I get a monthly download of reference data used in medical application and the 485 downloaded files don't have file extensions (#2gb). Unless I can load without file extensions I have to copy the files with .dat and load from there.

wanted to use results of find command in custom script that i am building

I want to validate my XML's for well-formed ness, but some of my files are not having a single root (which is fine as per business req eg. <ri>...</ri><ri>..</ri> is valid xml in my context) , xmlwf can do this, but it flags out a file if it's not having single root, So wanted to build a custom script which internally uses xmlwf, my custom script should do below,
iterate through list of files passed as input (eg. sample.xml or s*.xml or *.xml)
for each file prepare a temporary file as <A>+contents of file+</A>
and call xmlwf on that temp file,
Can some one help on this?
You could add text to the beginning and end of the file using cat and bash, so that your file has a root added to it for validation purposes.
cat <(echo '<root>') sample.xml <(echo '</root>') | xmlwf
This way you don't need to write temporary files out.

s3cmd sync is remote copying the wrong files to the wrong locations

I've got the following as part of a shell script to copy site files up to a S3 CDN:
for i in "${S3_ASSET_FOLDERS[#]}"; do
s3cmd sync -c /path/to/.s3cfg --recursive --acl-public --no-check-md5 --guess-mime-type --verbose --exclude-from=sync_ignore.txt /path/to/local/${i} s3://my.cdn/path/to/remote/${i}
done
Say S3_ASSET_FOLDERS is:
("one/" "two/")
and say both of those folders contain a file called... "script.js"
and say I've made a change to two/script.js - but not touched one/script.js
running the above command will firstly copy the file from /one/ to the correct location, although I've no idea why it thinks it needs to:
INFO: Sending file
'/path/to/local/one/script.js', please wait...
File
'/path/to/local/one/script.js'
stored as
's3://my.cdn/path/to/remote/one/script.js' (13551
bytes in 0.1 seconds, 168.22 kB/s) [1 of 0]
... and then a remote copy operation for the second folder:
remote copy: two/script.js -> script.js
What's it doing? Why?? Those files aren't even similar. Different modified times, different checksums. No relation.
And I end up with an s3 bucket with two incorrect files in. The file in /two/ that should have been updated, hasn't. And the file in /one/ that shouldn't have changed is now overwritten with the contents of /two/script.js
Clearly I'm doing something bizarrely stupid because I don't see anyone else having the same issue. But I've no idea what??
First of all, try to run it without --no-check-md5 option.
Second, I suggest you to pay attention to directory names, specifically trailing slashes.
s3cmd documentation says:
With directories there is one thing to watch out for – you can either upload the directory and its contents or just the contents. It all depends on how you specify the source.
To upload a directory and keep its name on the remote side specify the source without the trailing slash
On the other hand to upload just the contents, specify the directory it with a trailing slash

Resources