Safe accessing inotified files - shell

In a shell script I want to wait for a certain file to appear
inotifywait -m -r -q -e close_write -e moved_to --format '%w/%f' $ARCHIVE_ROOT | while read FILE; do
# Call other programs which process the file like stat or sha1sum
done
I would have assumed the file to be valid while in the handling code. Sadly the file seems to disappear e.g. after being processed by sha1sum.
Did I miss something obvious, which is necessary to make the file last?

Many processes create temporary files and quickly delete them. Such is the nature of progress.
To prevent a file from being deleted while your loop executes, use a hard link:
inotifywait -m -r -q -e close_write -e moved_to --format '%w/%f' $ARCHIVE_ROOT | while read FILE; do
ln "$FILE" tmpfile$$ || continue
sha1sum tmpfile$$
rm tmpfile$$
done
If the file is destroyed before your while loop starts, there is nothing you can do. If, however, it exists at the start of the loop, the use of a hardlink will ensure that it continues to exist as long as you need it.
The hard link with not stop other processes from changing the file. If that is a problem, then a copy (cp) is necessary.
For this to work, the temporary file has to be on the same filesystem as the original.
If this application is security-sensitive, you may want to use mktemp or other such utility to generate the name for the temporary file.

Related

Monitor Pre-existing and new files in a directory with bash

I have a script using inotify-tool.
This script notifies when a new file arrives in a folder. It performs some work with the file, and when done it moves the file to another folder. (it looks something along these line):
inotifywait -m -e modify "${path}" |
while read NEWFILE
work on/with NEWFILE
move NEWFILE no a new directory
done
By using inotifywait, one can only monitor new files. A similar procedure using for OLDFILE in path instead of inotifywait will work for existing files:
for OLDFILE in ${path}
do
work on/with OLDFILE
move NEWFILE no a new directory
done
I tried combining the two loops. By first running the second loop. But if files arrive quickly and in large numbers there is a change that the files will arrive wile the second loop is running. These files will then not be captured by neither loop.
Given that files already exists in a folder, and that new files will arrive quickly inside the folder, how can one make sure that the script will catch all files?
Once inotifywait is up and waiting, it will print the message Watches established. to standard error. So you need to go through existing files after that point.
So, one approach is to write something that will process standard error, and when it sees that message, lists all the existing files. You can wrap that functionality in a function for convenience:
function list-existing-and-follow-modify() {
local path="$1"
inotifywait --monitor \
--event modify \
--format %f \
-- \
"$path" \
2> >( while IFS= read -r line ; do
printf '%s\n' "$line" >&2
if [[ "$line" = 'Watches established.' ]] ; then
for file in "$path"/* ; do
if [[ -e "$file" ]] ; then
basename "$file"
fi
done
break
fi
done
cat >&2
)
}
and then write:
list-existing-and-follow-modify "$path" \
| while IFS= read -r file
# ... work on/with "$file"
# move "$file" to a new directory
done
Notes:
If you're not familiar with the >(...) notation that I used, it's called "process substitution"; see https://www.gnu.org/software/bash/manual/bash.html#Process-Substitution for details.
The above will now have the opposite race condition from your original one: if a file is created shortly after inotifywait starts up, then list-existing-and-follow-modify may list it twice. But you can easily handle that inside your while-loop by using if [[ -e "$file" ]] to make sure the file still exists before you operate on it.
I'm a bit skeptical that your inotifywait options are really quite what you want; modify, in particular, seems like the wrong event. But I'm sure you can adjust them as needed. The only change I've made above, other than switching to long options for clarity/explicitly and adding -- for robustness, is to add --format %f so that you get the filenames without extraneous details.
There doesn't seem to be any way to tell inotifywait to use a separator other than newlines, so, I just rolled with that. Make sure to avoid filenames that include newlines.
By using inotifywait, one can only monitor new files.
I would ask for a definition of a "new file". The man inotifywait specifies a list of events, which also lists events like create and delete and delete_self and inotifywait can also watch "old files" (beeing defined as files existing prior to inotifywait execution) and directories. You specified only a single event -e modify which notifies about modification of files within ${path}, it includes modification of both preexisting files and created after inotify execution.
... how can one make sure that the script will catch all files?
Your script is just enough to catch all the events that happen inside the path. If you have no means of synchronization between the part that generates files and the part that receives, there is nothing you can do and there always be a race condition. What if you script receives 0% of CPU time and the part that generates the files will get 100% of CPU time? There is no guarantee of cpu time between processes (unless using certified real time system...). Implement a synchronization between them.
You can watch some other event. If the generating sites closes files when ready with them, watch for the close event. Also you could run work on/with NEWFILE in parallel in background to speed up execution and reading new files. But if the receiving side is slower then the sending, if your script is working on NEWFILEs slower then the generating new files part, there is nothing you can do...
If you have no special characters and spaces in filenames, I would go with:
inotifywait -m -e modify "${path}" |
while IFS=' ' read -r path event file ;do
lock "${path}"
work on "${path}/${file}"
ex. mv "${path}/${file}" ${new_location}
unlock "${path}"
done
where lock and unlock is some locking mechanisms implemented between your script and the generating part. You can create a communication between the-creation-of-files-process and the-processing-of-the-files-process.
I think you can use some transaction file system, that would let you to "lock" a directory from the other scripts until you are ready with the work on it, but I have no experience in that field.
I tried combining the two loops. But if files arrive quickly and in large numbers there is a change that the files will arrive wile the second loop is running.
Run the process_new_file_loop in background prior to running the process_old_files_loop. Also it would be nice to make sure (ie. synchronize) that inotifywait has successfully started before you continue to the processing-existing-files-loop so that there is also no race conditions between them.
Maybe a simple example and/or startpoint would be:
work() {
local file="$1"
some work "$file"
mv "$file" "$predefiend_path"
}
process_new_files_loop() {
# let's work on modified files in parallel, so that it is faster
trap 'wait' INT
inotifywait -m -e modify "${path}" |
while IFS=' ' read -r path event file ;do
work "${path}/${file}" &
done
}
process_old_files_loop() {
# maybe we should parse in parallel here too?
# maybe export -f work; find "${path} -type f | xargs -P0 -n1 -- bash -c 'work $1' -- ?
find "${path}" -type f |
while IFS= read -r file; do
work "${file}"
done
}
process_new_files_loop &
child=$!
sleep 1
if ! ps -p "$child" >/dev/null 2>&1; then
echo "ERROR running processing-new-file-loop" >&2
exit 1
fi
process_old_files_loop
wait # wait for process_new_file_loop
If you really care about execution speeds and want to do it faster, change to python or to C (or to anything but shell). Bash is not fast, it is a shell, should be used to interconnect two processes (passing stdout of one to stdin of another) and parsing a stream line by line while IFS= read -r line is extremely slow in bash and should be generally used as a last resort. Maybe using xargs like xargs -P0 -n1 sh -c "work on $1; mv $1 $path" -- or parallel would be a mean to speed things up, but an average python or C program probably will be nth times faster.
A simpler solution is to add an ls in front of the inotifywait in a subshell, with awk to create output that looks like inotifywait.
I use this to detect and process existing and new files:
(ls ${path} | awk '{print "'${path}' EXISTS "$1}' && inotifywait -m ${path} -e close_write -e moved_to) |
while read dir action file; do
echo $action $dir $file
# DO MY PROCESSING
done
So it runs the ls, format the output and sends it to stdout, then runs the inotifywait in the same subshell sending the output also to stdout for processing.

inotifywait triggering event twice while converting docx to PDF

I have shell script with inotifwait set up as under:
inotifywait -r -e close_write,moved_to -m "<path>/upload" --format '%f######%e######%w'
There are some docx files residing in watched directory and some script converts docx to PDF via below command:
soffice --headless --convert-to pdf:writer_pdf_Export <path>/upload/somedoc.docx --outdir <path>/upload/
Somehow event is triggered twice as soon as PDF is generated. Entries are as under:
somedoc.pdf######CLOSE_WRITE,CLOSE######<path>/upload/
somedoc.pdf######CLOSE_WRITE,CLOSE######<path>/upload/
What else is wrong here?
Regards
It's triggered twice because this is how soffice appears to behave internally.
One day it may start writing it 10 times and doing sleep 2 between such writes during a single run, our program can't and I believe shouldn't anticipate it and depend on it.
So I'd try solving the problem from a different angle - lets just put the converted file into a temporary directory and then move it to the target dir, like this:
soffice --headless --convert-to pdf:writer_pdf_Export <path>/upload/somedoc.docx --outdir <path>/tempdir/ && mv <path>/tempdir/somedoc.pdf <path>/upload/
and use inotifywait in the following way:
inotifywait -r -e moved_to -m "<path>/upload" --format '%f######%e######%w'
The advantage is that you no longer depend on soffice's internal logic.
If you can't adjust behavior of the script producing the pdf files then indeed you'll need to resort to a workaround like #Tarun suggested.
I don't think you can control the external program as such. But I assume you are using this output for a pipe and then inputing it some place else. In that case you can avoid a event that happens continuously with a span of few seconds
So we add %T to --format and --timefmt "%s" to get the epoch time. Below is the updated command
$ inotifywait -r -e close_write,moved_to --timefmt "%s" -m "/home/vagrant" --format '%f######%e######%w##T%T' -q | ./process.sh
test.txt######CLOSE_WRITE,CLOSE######/home/vagrant/
Skipping this event as it happend within 2 seconds. TimeDiff=2
test.txt######CLOSE_WRITE,CLOSE######/home/vagrant/
This was done by using touch test.txt, multiple time every second. And as you can see second even was skipped. The process.sh is a simple bash script
#!/bin/bash
LAST_EVENT=
LAST_EVENT_TIME=0
while read line
do
DEL="##T"
EVENT_TIME=$(echo "$line" | awk -v delimeter="$DEL" '{split($0,a,delimeter)} END{print a[2]}')
EVENT=$(echo "$line" | awk -v delimeter="$DEL" '{split($0,a,delimeter)} END{print a[1]}')
TIME_DIFF=$(( $EVENT_TIME - $LAST_EVENT_TIME))
if [[ "$EVENT" == "$LAST_EVENT" ]]; then
if [[ $TIME_DIFF -gt 2 ]]; then
echo "$EVENT"
else
echo "Skipping this event as it happend within 2 seconds. TimeDiff=$TIME_DIFF"
fi
else
echo $EVENT
LAST_EVENT_TIME=$EVENT_TIME
fi
LAST_EVENT=$EVENT
done < "${1:-/dev/stdin}"
In your actual script you will disable the echo in if, this one was just for demo purpose

Are glob expressions subject to caching? How can a refresh be forced?

My script downloads files from a web server in an infinite loop. My script calls wget to get the newest files (ones I haven't gotten before), then each new file needs to be processed. The problem is that after running wget, the files have been properly downloaded (based on an ls in a separate window), but sometimes my script (specifically, the line beginning for curFile in) sees them and sometimes it doesn't, which makes me think it is sometimes looking at an outdated cache.
while [ 5 -lt 10 ]; do
timestamp=(date +%s)
wget -mbr -l0 --no-use-server-timestamps --user=username --password=password ftp://ftp.mysite.com/public_ftp/incoming/*.txt
for curFile in ftp.mysite.com/public_ftp/incoming/*.txt; do
curFileMtime=$(stat -c %W "$curFile")
if((curFileMtime > timestamp)); then
echo "$curFile"
cp "$curFile" CommLink/MDLFile
cd CommLink
SendMDLGetTab
cd ..
fi
done
sleep 120
done
The first few times through the loop this seems to work fine, then it becomes sporadic afterwards (sometimes it sees the new files and sometimes it doesn't). I've done a lot of googling, and found that bash does cache pathnames for use in running executables (so sometimes it tries to execute things that aren't there, if the executable has been recently removed) but I haven't found anything on caching non-executable filenames, which would result in it not seeing things that are there. Any ideas? If it is a caching problem, how can I force it to not look at the cache?
As the most immediate issue -- the -b argument to wget tells it to run in the background. Thus, with this flag set, the first subsequent command takes place while wget is still running.
Beyond that: Results of glob expressions -- such as ftp.mysite.com/public_ftp/incoming/*.txt -- are not cached by the shell. However, this glob is only evaluated once per loop: If a new text file isn't present at the start of that loop, it won't be picked up until the next iteration.
However, the mechanism the code in the question uses for excluding files already present before wget was run is prone to race conditions. I would suggest the following instead:
while IFS= read -r -d '' filename; do
[[ "$filename" -nt CommLink/MDLFile ]] || continue # retest in case MDLFile has changed
cp -- "$filename" CommLink/MDLFile && {
touch -r "$filename" CommLink/MDLFile # copy mtime to destination
(cd CommLink && exec SendMDLGetTab) # scope cd to subshell
}
done < <(find ftp.mysite.com/public_ftp/incoming/ \
-name '*.txt' \
-newer CommLink/MDLFile \
-print0)
Some of the finer points:
The code above compares timestamps to the current copy of MDLFile, rather than to the beginning of the current iteration of the loop. This is more robust in terms of ensuring that updates are processed if a prior invocation of this script was interrupted after the wget but before the cp.
Using touch -r ensures that the new copy of MDLFile retains the mtime of the original. (One might replace the cp with ln -f to hardlink the inode to get this same effect without any race conditions and while only storing MDLFile once on-disk, if the side effects are acceptable).
The code above only performs operations intended to be run inside a subdirectory if the cd into that subdirectory succeeded, and scopes that cd by performing operations intended to be performed in a separate directory in a subshell. (The cost of this subshell is offset by using exec when running the external command its ultimate intent is to trigger).
Using a NUL-delimited stream, as in find -print0 | while IFS= read -r -d '' filename, ensures that all possible names -- including names with literal newlines -- can be correctly handled.
Whether timestamps are stored at or beyond integer-level resolution varies by filesystem; however, bash only supports integer math. The above -- wherein no numeric comparisons are performed by the shell -- is thus somewhat more robust.

keep renaming until file doesnt exist in directory

I need to check if a file exists in a directory, if it does rename it by adding an extra extension like original.txt to original.txt.txt. BUT check whether the renamed file still exist.
My code only changes it one time and is not checking the rest of the contents before it renames it. It just overwrites all my original.txt.txt
Question:
How to check all files and add .txt as many times as needed so it doesn't conflict with any other name in directory.
if [ -e "$destination_folder/$basename_file" ]
then
cp "$file_name" "$destination_folder/$basename_file".txt
fi
If I understand you correctly, you want to rename a file only if there is no other file of the new name. This you want to prevent accidental overwriting of this other file.
To achieve this you can use mv (which I assume you use for renaming) with the option -i so it will ask before overwriting anything. In case you are quite sure you do not want to rename anything then, you can pipe yes n into the mv command:
yes n | mv -i "$old" "$new"
To prevent seeing ugly automatically answered questions, you can redirect the stderr of the commands to /dev/null:
(yes n | mv -i "$old" "$new") 2> /dev/null
This will now rename the file only if the new name is free. Otherwise it does nothing.
EDIT:
You can use a loop to find a "free" name:
name=$original
while [ -e "$name" ]
do
name="$name.txt"
done
mv "$original" "$name"
This way you will find a free name and then use it.
Be aware that this can still overwrite a just recently created file of the new name. Unix systems are multitasking and another process could create a file of the new name just after the loop and before the mv command.
To avoid this pathological case (in case you cannot be sure that this won't happen), you should use my method above to move only if nothing would be overwritten, and afterwards you should check if the original file (with the original file name) still exists, and if it still exists, try the whole thing again:
new=$old.txt
while true
do
(yes n | mv -i "$old" "$new") 2> /dev/null
if [ -e "$old" ]
then
new=$new.txt
else
break
fi
done

Using inotify to check for directory modifications

I'm getting started to bash scripts and I am making a script that backs up a directory on any modifications. Modification includes new files/folders and modifications to existing files. ow would I go about doing this using inotify ?
Thanks!
I think you're looking for the tool inotifywait.
For example:
inotifywait -r -m -e modify -e create <one_or_more_dirs> |
while read line
do
echo $line
done

Resources