determine if file is complete - ruby

I am trying to write a video ruby transformer script (using ffmpeg) that depends on mov files being ftped to a server.
The problem I've run into is that when a large file is uploaded by a user, the watch script (using rb-inotify) attempts to execute (and run the transcoder) before the mov is completely uploaded.
I'm a complete noob. But I'm trying to discover if there is a way for me to be able to ensure my watch script doesn't run until the file(s) is/are completely uploaded.
My watch script is here:
watch_me = INotify::Notifier.new
watch_me.watch("/directory_to_my/videos", :close_write) do |directories|
load '/directory_to_my/videos/.transcoder.rb'
end
watch_me.run
Thank you for any help you can provide.

Just relying on inotify(7) to tell you when a file has been updated isn't a great fit for telling when an upload is 'complete' -- an FTP session might time out and be re-started, for example, allowing a user to upload a file in chunks over several days as connectivity is cheap or reliable or available. inotify(7) only ever sees file open, close, rename, and access, but never the higher-level event "I'm done modifying this file", as the user would understand it.
There are two mechanisms I can think of: one is to have uploads go initially into one directory and ask the user to move the file into another directory when the upload is complete. The other creates some file meta-data on the client and uses that to "know" when the upload is complete.
Move completed files manually
If your users upload into the directory ftp/incoming/temporary/, they can upload the file in as many connections is required. Once the file is "complete", they can rename the file (rename ftp/incoming/temporary/hello.mov ftp/incoming/complete/hello.mov) and your rb-inotify interface looks for file renames in the ftp/incoming/complete/ directory, and starts the ffmpeg(1) command.
Generate metadata
For a transfer to be "complete", you're really looking for two things:
The file is the same size on both systems.
The file is identical on both systems.
Since "identical" is otherwise difficult to check, most people content themselves with checking if the contents of the file, when run through a cryptographic hash function such as MD5 or SHA-1 (or better, SHA-224, SHA-256, SHA-384, or SHA-512) functions. MD5 is quite fine if you're guarding against incomplete transmission but if you intend on using the output of the function for other means, using a stronger function would be wise.
MD5 is really tempting though, since tools to create and validate MD5 hashes are very widespread: md5sum(1) on most Linux systems, md5(1) on most BSD systems (including OS X).
$ md5sum /etc/passwd
c271aa0e11f560af419557ef49a27ac8 /etc/passwd
$ md5sum /etc/passwd > /tmp/sums
$ md5sum -c /tmp/sums
/etc/passwd: OK
The md5sum -c command asks the md5sum(1) program to check the file of hashes and filenames for correctness. It looks a little silly when used on just a single file, but when you've got dozens or hundreds of files, it's nice to let the software do the checking for you. For example: http://releases.mozilla.org/pub/mozilla.org/firefox/releases/3.0.19-real-real/MD5SUMS -- Mozilla has published such files with 860 entries -- checking them by hand would get tiring.
Because checking hashes can take a long time (five minutes on my system to check a high-definition hour-long video that wasn't recently used), it'd be a good idea to only check the hashes when the filesizes match. Modify your upload tool to send along some metadata about how long the file is and what its cryptographic hash is. When your rb-inotify script sees file close requests, check the file size, and if the sizes match, check the cryptographic hash. If the hashes match, then start your ffmpeg(1) command.

It seems easier to upload the file to a temporal directory on the server and move it to the location your script is watching once the transfer is completed.

Related

Transferring (stopping, resuming) file using rsync

I have an external hard-drive that I suspect is on its way out. At the minute, I can transfer files from it, but only for a while. Unfortunately, I have one single file that's >50GB in size. My solution to this is to use rsync to transfer this one particular file a bit at a time, leave the drive to rest (switch it off), and resume a little while later.
I'm using rsync --partial --progress --inplace --append -a /Volumes/Backup\ Drive/chris/Desktop/Recording\ Sessions/S1/Session\ 1/untitled ~/Desktop/temp to transfer it. (The file is in the untitled folder, which I'm moving into the temp folder) However, after having stopped it and resumed it, it seems to be over-writing the previous attempt at the file, meaning I don't really get any further.
Is there something I'm missing? :X
Thankyou ^_^
EDIT: Still don't know :\
Well, since this is a programming site, here's a program to do it. I tested it on OS X, but you should definitely test it on some small files first to make sure it does what you want:
#!/usr/bin/env python
import os
import sys
source = sys.argv[1]
target = sys.argv[2]
begin = int(sys.argv[3])
end = int(sys.argv[4])
mode = 'r+b' if os.path.exists(target) else 'w+b'
with open(source, 'rb') as source_file, open(target, mode) as target_file:
source_file.seek(begin)
target_file.seek(begin)
buffer = source_file.read(end - begin)
target_file.write(buffer)
You run this with four arguments: the source file, the destination, and two numbers. The first number is the byte count to start copying from (so on the first run you'd use 0). The second number is the byte count to copy until (not including). So on subsequent runs you'd always use the previous fourth argument as the new third argument (new begin equals old end). And just go on like that until it's done, using whatever sizes you like along the way.
I know this is related to macOS, but the best way to get all the files off a dying drive is with GNU ddrescue. I have no idea if this runs nicely on macOS, but you can always use a Linux live-usb to do this. You'll want to open a terminal and be either root (preferred) or use sudo.
Firstly, find the disk that you want to backup. This can be done by running the following. Make note of the partition name or disk name that you want to back up. Hard drives/flash drives will typically use the format sdX, where X is the drive letter. Partitions will be listed under sdX1, sdX2... etc. NVMe drives/partitions follow a similar naming convention.
lsblk -o name,size,label,fstype,model
Mount and change directory (cd) to a writable location that is bigger than the drive/partition you want to back up.
Now we are going to do a first pass over the drive/partition. This will do a first pass, without stopping on problematic sections. This will ensure that ddrescue does not cause any more damage by trying to access a bad section. Think of it like a hole in a sweater -- you wouldn't want to keep picking at the hole or it would get bigger. Run the following, with sdX replaced with the drive/partition name from earlier:
ddrescue -d /dev/sdX backup.img backup.logfile
the -d flag uses direct disk access and ignores the kernel cache, and the logfile is important in case the drive gets disconnected or the process stops somehow.
Run ddrescue again with the -r flag. This will retry bad sections 3 times. Feel free to run this a few times, but note that ddrescue cannot restore everything. From my experience it usually restores in the high 90%s, and many of the files are system files (aka not your personal files).
ddrescue -d -r3 /dev/sdX backup.img backup.logfile
Finally, you can use the image however you want. You can either mount it to copy the files off or use it in a virtual machine/burn it to a working drive with dd. Do note that the latter options will not always work if system critical files were damaged.
Good luck and remember to make backups!

How to check if file is being copied under macOS

Users on our network copy files on the server in a directory called "DropBox" with AFP connection, simply dragging them with the Finder.
A script running on the server checks periodically for the presence of new files inside "DropBox" and then moves them with mv into other directories.
How can the script check if a file is being copied (and wait for the process to complete before moving it away)?
I've tried with fuser filename with no success. If the file copy is issued by a remote machine fuser reports that no process is using the file.
There are two approaches to this problem:
1. flag the filename to tell the file is partial:
When a file is written (moved/copied) into the DropBox .part extension is added to the name. When the writing is over the .part extension is removed.
Other processes operate on files only if the .part extension is not present.
2. check the file size; if it's constant then the write operation is ended (maybe)
This is less accurate and more complicated that (1) but offer a good guess if option 1 is not viable for whatever reason.
For each file in the dropbox the time the size isn't changed is recorded.
When a file increases in size the time value is reset to zero.
When a file is not being written the time values remains constant.
If a file isn't grown for at least (30 seconds, 1 minute, the more time you can afford the better it is of course) then write operation are over and the file can be managed.
This approach fails when a transfer in interrupted.

How to programmatically find the difference between two directories

First off; I am not necessarily looking for Delphi code, spit it out any way you want.
I've been searching around (especially here) and found a bit about people looking for ways to compare to directories (inclusive subdirs) though they were using byte-by-byte methods. Second off, I am not looking for a difftool, I am "just" looking for a way to find files which do not match and, just as important, files which are in one directory but not the other and vice versa.
To be more specific: I have one directory (the backup folder) which I constantly update using FindFirstChangeNotification. Though the first time I need to copy all files and I also need to check the backup directory against the original when the applications starts (in case something happened when the application wasn't running or FindFirstChangeNotification didn't catch a file change). To solve this I am thinking of creating a CRC list for the backed up files and then run through the original directory computing the CRC for every file and finally compare the two CRCs. Then somehow look for files which are in one directory and not the other (again; vice versa).
Here's the question: Is this the fastest way? If so, how would one (roughly) get the job done?
You don't necessarily need CRCs for each file, you can just compare the "last modified" date for every file for most normal purposes. It's WAY faster. If you need additional safety, you can also compare the lengths. You get both of these metrics for free with the find functions.
And in your change notification, you should probably add the files to a queue and use a timer object to copy the new queued files every ~30sec or something, so you don't bog down the system with frequent updates/checks.
For additional speed, use the Win32 functions wherever possible, avoid any Delphi find/copy/getfileinfo functions. I'm not familiar with the Delphi framework but for example the C# stuff is WAY WAY WAY slower than the Win32 functions.
Regardless of you "not looking for a difftool", are you opposed to using Cygwin with it's "diff" command for the shell? If you are open to this its quite easy, particularly using diff with the -r "recursive" option.
The following generates the differences between 2 Rails installs on my machine, and greps out not only information about differences between files but also, specifically by grepping for 'Only', finds files in one directory, but not the other:
$ diff -r pgnindex pgnonrails | egrep '^Only|diff'
Only in pgnindex/app/controllers: openings_controller.rb
Only in pgnindex/app/helpers: openings_helper.rb
Only in pgnindex/app/views: openings
diff -r pgnindex/config/environment.rb pgnonrails/config/environment.rb
diff -r pgnindex/config/initializers/session_store.rb pgnonrails/config/initializers/session_store.rb
diff -r pgnindex/log/development.log pgnonrails/log/development.log
Only in pgnindex/test/functional: openings_controller_test.rb
Only in pgnindex/test/unit: helpers
The fastest way to compare one directory on the local machine to a directory on another machine thousands of miles away is exactly as you propose:
generate a CRC/checksum for every file
send the name, path, and CRC/checksum for each file over the internet to the other machine
compare
Perhaps the easiest way to do that is to use rsync with the "--dryrun" or "--list-only" option.
(Or use one of the many applications that use the rsync algorithm,
or compile the rsync algorithm into your application).
cd some_backup_directory
rsync --dryrun myname#remote_host:latest_version_directory .
For speed, the default rsync assumes, as Blindy suggested, that two files with the same name and the same path and the same length and the same modification time are the same.
For extra safety, you can give rsync the "--checksum" option to ignore the length and modification time and force it to compare (the checksum of) the actual contents of the file.

Verify whether ftp is complete or not?

I got an application which is polling on a folder continuously. Once any file is ftp to the folder, the application has to move this file to some other folder for processing.
Here, we don't have any option to verify whether ftp is complete or not.
One command "lsof" is suggested in the technical forums. It got a file description column which gives the file status.
Since, this is a free bsd command and not present in old versions of linux, I want to clarify the usage of this command.
Can you guys tell us your experience in file verification and is there any other alternative solution available?
Also, is there any risk in using this utility?
Appreciate your help in advance.
Thanks,
Mathew Liju
We've done this before in a number of different ways.
Method one:
If you can control the process sending the files, have it send the file itself followed by a sentinel file. For example, send the real file "contracts.doc" followed by a one-byte "contracts.doc.sentinel".
Then have your listener process watch out for the sentinel files. When one of them is created, you should process the equivalent data file, then delete both.
Any data file that's more than a day old and doesn't have a corresponding sentinel file, get rid of it - it was a failed transmission.
Method two:
Keep an eye on the files themselves (specifically the last modification date/time). Only process files whose modification time is more than N minutes in the past. That increases the latency of processing the files but you can usually be certain that, if a file hasn't been written to in five minutes (for example), it's done.
Conclusion:
Both those methods have been used by us successfully in the past. I prefer the first but we had to use the second one once when we were not allowed to change the process sending the files.
The advantage of the first one is that you know the file is ready when the sentinel file appears. With both lsof (I'm assuming you're treating files that aren't open by any process as ready for processing) and the timestamps, it's possible that the FTP crashed in the middle and you may be processing half a file.
There are normally three approaches to this sort of problem.
providing a signal file so that when your file is transferred, an additional file is sent to mark that transfer is complete
add an entry to a log file within that directory to indicate a transfer is complete (this really only works if you have a single peer updating the directory, to avoid concurrency issues)
parsing the file to determine completeness. e.g. does the file start with a length field, or is it obviously incomplete ? e.g. parsing an incomplete XML file will result in a parse error due to the lack of an end element. Depending on your file's size and format, this can be trivial, or can be very time-consuming.
lsof would possibly be an option, although you've identified your Linux portability issue. If you use this, note the -F option, which formats the output suitable for processing by other programs, rather than being human-readable.
EDIT: Pax identified a fourth (!) method I'd forgotten - using the fact that the timestamp of the file hasn't updated in some time.
There is a fifth method. You can also check if the FTP Session is still active. This will work if every peer has it's own ftp user account. As long as the user is not logged off from FTP, assume the files are not complete.

Is file still being uploaded?

I have an app that I'm writing that takes files in a specific directory that have been uploaded via SFTP and moves them to S3.
I have a problem where my cron job starts uploading a file when it's not completely uploaded. I have thought of every way to try and wait until the file is complete, but I have no way of knowing (that I know of).
I'm hoping that the collective genius of SO would be able to shed some light on this!
There are a number of ways to handle this:
Change the upload process to upload the data file itself (e.g., data.txt) followed by a sentinel file (e.g., data.txt.sentinel). Then wait for the sentinel before processing the data file and deleting them both. Data files older then N days with no corresponding sentinel - just delete them. This is only good if you can change the uploader.
If you can evaluate the content of the file to check completeness, this is another way. For example, if you're only uploading HTML files, you could check that it ends with </html>. Not always possible unless you can control what's being uploaded.
The not-been-modified-for-a-while method. Basically, if the file hasn't been modified for N minutes, you can assume the upload has been finished. This may still result in the processing of incomplete files where the transfer has failed partway through.
All these methods have their advantages and drawbacks and you will have to decide which is the best for you. We try to opt for number 1 where we can influence the uploading side.
And remember that N is configurable in the above scenarios. You need to balance the possibility that a too-small N will result in you processing an incomplete file in option 3 but too large a value of N will delay the processing of said file.
Is there any way you can add a step after the SFTP transfer? The idea is to SFTP the files to a temporary directory, then once that's done have the same client execute (via SSH) a script to mv the files over to the directory the cron job is looking at. mv is atomic on many local Unix filesystems, so the cron job will only either see the old file or the new one.
Of course, if you can execute a script after the SFTP transfer you can just have the script do the transfer to S3, without the cron job ;)
We are using pure-ftpd for a very similar process. Rather then having a cron job do the uploads, we use the upload script option of pure-ftp, which triggers a script every time an upload is complete. You might consider using a similar mechanism if it is available with your ftp server.

Resources