Snakemake - rule that downloads data - download

I am having some trouble implementing a pipeline in which the first step is downloading the data from some server. As far as I understand, all rules must have inputs which are files. However, in my case the "input" is an ID string given to a script which accesses the server and downloads the data.
I am aware of the remote files option in snakemake, but the server I am downloading from (ENA) is not on that list. Moreover, I am using a script which calls aspera in order to improve download speed.
Any ideas of how such a scenario can be implemented in snakemake?

Rules actually do not need an input field, so sth like this is possible:
rule download:
output:
"downloads/{sample}.fa"
shell:
"ascp ftp:/url_here+{wildcards.sample}"

Related

Get files from Ab-initio server to SFTP server

I need the shell script to pull the .dat file from source server to SFTP server.
Every time the job runs, shell script has to verify if the table already exists in sftp server and get all the files corresponding to that table with date greater than the existing file. (file comparison is required based on the date in the filename).
Example: Yesterday, job ran and file "table1_extract_20190101.dat" is extracted. And in source server, I have 2 files "table1_extract_20190102.dat", "table1_extract_20190103.dat". Then it has to get both the files and so on for each and every table.
Please suggest on how this could be implemented.
Thanks
Use Ab Initio SFTP To component.
Ideally, add it at the end of the graph that creates the files, so all handling is in one place. The SFTP To component(s) would run in a new phase after the files are written.
Or, create another Ab Initio graph that looks for filenames based on the filename specification used to generate the original filenames. One risk is being sure the files have been written completely, which is why it is ideal to do it in the original graph. You would need to schedule this graph to run after the first graph is complete. A good way to do that is with a plan. Another way using Control>Center is to schedule this job after the previous one completes by adding a job dependency.

Bosun adding external collectors

What is the procedure to define new external collectors in bosun using scollector.
Can we write python or shell scripts to collect data?
The documentation around this is not quite up to date. You can do it as described in http://godoc.org/bosun.org/cmd/scollector#hdr-External_Collectors , but we also support JSON output which is better.
Either way, you write something and put it in the external collectors directory, followed by a frequency directory, and then an executable script or binary. Something like:
<external_collectors_dir>/<freq_sec>/foo.sh.
If the directory frequency is zero 0, then the the script is expected to be continuously running, and you put a sleep inside the code (This is my preferred method for external collectors). The scripts outputs the telnet format, or the undocumented JSON format to stdout. Scollector picks it up, and queues that information for sending.
I created an issue to get this documented not long ago https://github.com/bosun-monitor/bosun/issues/1225. Until one of us gets around to that, here is the PR that added JSON https://github.com/bosun-monitor/bosun/commit/fced1642fd260bf6afa8cba169d84c60f2e23e92
Adding to what Kyle said, you can take a look at some existing external collectors to see what they output. here is one written in java that one of our colleagues wrote to monitor jvm stuff. It uses the text format, which is simply:
metricname timestamp value tag1=foo tag2=bar
If you want to use the JSON format, here is an example from one of our collectors:
{"metric":"exceptional.exceptions.count","timestamp":1438788720,"value":0,"tags":{"application":"AdServer","machine":"ny-web03","source":"NY_Status"}}
And you can also send metadata:
{"Metric":"exceptional.exceptions.count","Name":"rate","Value":"counter"}
{"Metric":"exceptional.exceptions.count","Name":"unit","Value":"errors"}
{"Metric":"exceptional.exceptions.count","Name":"desc","Value":"The number of exceptions thrown per second by applications and machines. Data is queried from multiple sources. See status instances for details on exceptions."}`
Or send error messages to stderror:
2015/08/05 15:32:00 lookup OR-SQL03: no such host

determine if file is complete

I am trying to write a video ruby transformer script (using ffmpeg) that depends on mov files being ftped to a server.
The problem I've run into is that when a large file is uploaded by a user, the watch script (using rb-inotify) attempts to execute (and run the transcoder) before the mov is completely uploaded.
I'm a complete noob. But I'm trying to discover if there is a way for me to be able to ensure my watch script doesn't run until the file(s) is/are completely uploaded.
My watch script is here:
watch_me = INotify::Notifier.new
watch_me.watch("/directory_to_my/videos", :close_write) do |directories|
load '/directory_to_my/videos/.transcoder.rb'
end
watch_me.run
Thank you for any help you can provide.
Just relying on inotify(7) to tell you when a file has been updated isn't a great fit for telling when an upload is 'complete' -- an FTP session might time out and be re-started, for example, allowing a user to upload a file in chunks over several days as connectivity is cheap or reliable or available. inotify(7) only ever sees file open, close, rename, and access, but never the higher-level event "I'm done modifying this file", as the user would understand it.
There are two mechanisms I can think of: one is to have uploads go initially into one directory and ask the user to move the file into another directory when the upload is complete. The other creates some file meta-data on the client and uses that to "know" when the upload is complete.
Move completed files manually
If your users upload into the directory ftp/incoming/temporary/, they can upload the file in as many connections is required. Once the file is "complete", they can rename the file (rename ftp/incoming/temporary/hello.mov ftp/incoming/complete/hello.mov) and your rb-inotify interface looks for file renames in the ftp/incoming/complete/ directory, and starts the ffmpeg(1) command.
Generate metadata
For a transfer to be "complete", you're really looking for two things:
The file is the same size on both systems.
The file is identical on both systems.
Since "identical" is otherwise difficult to check, most people content themselves with checking if the contents of the file, when run through a cryptographic hash function such as MD5 or SHA-1 (or better, SHA-224, SHA-256, SHA-384, or SHA-512) functions. MD5 is quite fine if you're guarding against incomplete transmission but if you intend on using the output of the function for other means, using a stronger function would be wise.
MD5 is really tempting though, since tools to create and validate MD5 hashes are very widespread: md5sum(1) on most Linux systems, md5(1) on most BSD systems (including OS X).
$ md5sum /etc/passwd
c271aa0e11f560af419557ef49a27ac8 /etc/passwd
$ md5sum /etc/passwd > /tmp/sums
$ md5sum -c /tmp/sums
/etc/passwd: OK
The md5sum -c command asks the md5sum(1) program to check the file of hashes and filenames for correctness. It looks a little silly when used on just a single file, but when you've got dozens or hundreds of files, it's nice to let the software do the checking for you. For example: http://releases.mozilla.org/pub/mozilla.org/firefox/releases/3.0.19-real-real/MD5SUMS -- Mozilla has published such files with 860 entries -- checking them by hand would get tiring.
Because checking hashes can take a long time (five minutes on my system to check a high-definition hour-long video that wasn't recently used), it'd be a good idea to only check the hashes when the filesizes match. Modify your upload tool to send along some metadata about how long the file is and what its cryptographic hash is. When your rb-inotify script sees file close requests, check the file size, and if the sizes match, check the cryptographic hash. If the hashes match, then start your ffmpeg(1) command.
It seems easier to upload the file to a temporal directory on the server and move it to the location your script is watching once the transfer is completed.

Verify whether ftp is complete or not?

I got an application which is polling on a folder continuously. Once any file is ftp to the folder, the application has to move this file to some other folder for processing.
Here, we don't have any option to verify whether ftp is complete or not.
One command "lsof" is suggested in the technical forums. It got a file description column which gives the file status.
Since, this is a free bsd command and not present in old versions of linux, I want to clarify the usage of this command.
Can you guys tell us your experience in file verification and is there any other alternative solution available?
Also, is there any risk in using this utility?
Appreciate your help in advance.
Thanks,
Mathew Liju
We've done this before in a number of different ways.
Method one:
If you can control the process sending the files, have it send the file itself followed by a sentinel file. For example, send the real file "contracts.doc" followed by a one-byte "contracts.doc.sentinel".
Then have your listener process watch out for the sentinel files. When one of them is created, you should process the equivalent data file, then delete both.
Any data file that's more than a day old and doesn't have a corresponding sentinel file, get rid of it - it was a failed transmission.
Method two:
Keep an eye on the files themselves (specifically the last modification date/time). Only process files whose modification time is more than N minutes in the past. That increases the latency of processing the files but you can usually be certain that, if a file hasn't been written to in five minutes (for example), it's done.
Conclusion:
Both those methods have been used by us successfully in the past. I prefer the first but we had to use the second one once when we were not allowed to change the process sending the files.
The advantage of the first one is that you know the file is ready when the sentinel file appears. With both lsof (I'm assuming you're treating files that aren't open by any process as ready for processing) and the timestamps, it's possible that the FTP crashed in the middle and you may be processing half a file.
There are normally three approaches to this sort of problem.
providing a signal file so that when your file is transferred, an additional file is sent to mark that transfer is complete
add an entry to a log file within that directory to indicate a transfer is complete (this really only works if you have a single peer updating the directory, to avoid concurrency issues)
parsing the file to determine completeness. e.g. does the file start with a length field, or is it obviously incomplete ? e.g. parsing an incomplete XML file will result in a parse error due to the lack of an end element. Depending on your file's size and format, this can be trivial, or can be very time-consuming.
lsof would possibly be an option, although you've identified your Linux portability issue. If you use this, note the -F option, which formats the output suitable for processing by other programs, rather than being human-readable.
EDIT: Pax identified a fourth (!) method I'd forgotten - using the fact that the timestamp of the file hasn't updated in some time.
There is a fifth method. You can also check if the FTP Session is still active. This will work if every peer has it's own ftp user account. As long as the user is not logged off from FTP, assume the files are not complete.

Is file still being uploaded?

I have an app that I'm writing that takes files in a specific directory that have been uploaded via SFTP and moves them to S3.
I have a problem where my cron job starts uploading a file when it's not completely uploaded. I have thought of every way to try and wait until the file is complete, but I have no way of knowing (that I know of).
I'm hoping that the collective genius of SO would be able to shed some light on this!
There are a number of ways to handle this:
Change the upload process to upload the data file itself (e.g., data.txt) followed by a sentinel file (e.g., data.txt.sentinel). Then wait for the sentinel before processing the data file and deleting them both. Data files older then N days with no corresponding sentinel - just delete them. This is only good if you can change the uploader.
If you can evaluate the content of the file to check completeness, this is another way. For example, if you're only uploading HTML files, you could check that it ends with </html>. Not always possible unless you can control what's being uploaded.
The not-been-modified-for-a-while method. Basically, if the file hasn't been modified for N minutes, you can assume the upload has been finished. This may still result in the processing of incomplete files where the transfer has failed partway through.
All these methods have their advantages and drawbacks and you will have to decide which is the best for you. We try to opt for number 1 where we can influence the uploading side.
And remember that N is configurable in the above scenarios. You need to balance the possibility that a too-small N will result in you processing an incomplete file in option 3 but too large a value of N will delay the processing of said file.
Is there any way you can add a step after the SFTP transfer? The idea is to SFTP the files to a temporary directory, then once that's done have the same client execute (via SSH) a script to mv the files over to the directory the cron job is looking at. mv is atomic on many local Unix filesystems, so the cron job will only either see the old file or the new one.
Of course, if you can execute a script after the SFTP transfer you can just have the script do the transfer to S3, without the cron job ;)
We are using pure-ftpd for a very similar process. Rather then having a cron job do the uploads, we use the upload script option of pure-ftp, which triggers a script every time an upload is complete. You might consider using a similar mechanism if it is available with your ftp server.

Resources