rsync - How to copy a certain number of files, pause, repeat - bash

I have a situation which I have failed to find a solution for.
I have a process which generates ~10,000 xml files into one directory. Those files get rsync'd (with a delete on the source once copied) to a server which runs a process every 5 minutes to import them. The problem is that the volume of files is such that it takes longer than 5 minutes to process them and I can't change that timing. What I would like to do is come up with a script which would allow me to rsync the first 2500 files in the directory, wait 5 minutes, rsync the next 2500, etc. The numbers of files vary, so I'd want it to just keep going through until all the files have been copied. The order of the files doesn't matter, they could be listed alphabetically or by date or just random. Does anyone have any examples of how to do this?
Thanks!

If I understood correctly your problem, you need something like:
while true; do
ls | shuf -n 2500 > /tmp/sync_files # pick random files
rsync -av `cat /tmp/sync_files` /destination/ # sync the files
xargs rm < /tmp/sync_files # delete the files
sleep 300; # sleep 5 minutes
done
In the code you pick random files, synced in another directory, then remove them (if the files contain spaces or some weird characters it shall be done with a for loop and then rm command, and finally sleep 5 seconds. Let me know if I got your problem right.

Randomness is optional and we want to stop when the files are transmitted. Using the output of ls sometimes gives strange results So that would make it:
#!/bin/bash
qty=2500
sleeptime=300
typeset -i i
i=0
for f in * ; do
rsync -av "$f" /destination/
rm $f
i=$i+1
if [ $i = $qty ] ; then
sleep $sleeptime
i=0
fi
done
But then you do an rsync per file, which may also not be what you want.

Related

Bash script to check if a new file has been created on a directory after run a command

By using bash script, I'm trying to detect whether a file has been created on a directory or not while running commands. Let me illustrate the problem;
#!/bin/bash
# give base directory to watch file changes
WATCH_DIR=./tmp
# get list of files on that directory
FILES_BEFORE= ls $WATCH_DIR
# actually a command is running here but lets assume I've created a new file there.
echo >$WATCH_DIR/filename
# and I'm getting new list of files.
FILES_AFTER= ls $WATCH_DIR
# detect changes and if any changes has been occurred exit the program.
After that I've just tried to compare these FILES_BEFORE and FILES_AFTER however couldn't accomplish that. I've tried;
comm -23 <($FILES_AFTER |sort) <($FILES_BEFORE|sort)
diff $FILES_AFTER $FILES_BEFORE > /dev/null 2>&1
cat $FILES_AFTER $FILES_BEFORE | sort | uniq -u
None of them gave me a result to understand there is a change or not. What I need is detecting the change and exiting the program if any. I am not really good at this bash script, searched a lot on the internet however couldn't find what I need. Any help will be appreciated. Thanks.
Thanks to informative comments, I've just realized that I've missed the basics of bash script but finally made that work. I'll leave my solution here as an answer for those who struggle like me.:
WATCH_DIR=./tmp
FILES_BEFORE=$(ls $WATCH_DIR)
echo >$WATCH_DIR/filename
FILES_AFTER=$(ls $WATCH_DIR)
if diff <(echo "$FILES_AFTER") <(echo "$FILES_BEFORE")
then
echo "No changes"
else
echo "Changes"
fi
It outputs "Changes" on the first run and "No Changes" for the other unless you delete the newly added documents.
I'm trying to interpret your script (which contains some errors) into an understanding of your requirements.
I think the simplest way is simply to rediect the ls command outputto named files then diff those files:
#!/bin/bash
# give base directory to watch file changes
WATCH_DIR=./tmp
# get list of files on that directory
ls $WATCH_DIR > /tmp/watch_dir.before
# actually a command is running here but lets assume I've created a new file there.
echo >$WATCH_DIR/filename
# and I'm getting new list of files.
ls $WATCH_DIR > /tmp/watch_dir.after
# detect changes and if any changes has been occurred exit the program.
diff -c /tmp/watch_dir.after /tmp/watch_dir.before
If the any files are modified by the 'commands', i.e. the files exists in the 'before' list, but might change, the above will not show that as a difference.
In this case you might be better off using a 'marker' file created to mark the instance the monitoring started, then use the find command to list any newer/modified files since the market file. Something like this:
#!/bin/bash
# give base directory to watch file changes
WATCH_DIR=./tmp
# get list of files on that directory
ls $WATCH_DIR > /tmp/watch_dir.before
# actually a command is running here but lets assume I've created a new file there.
echo >$WATCH_DIR/filename
# and I'm getting new list of files.
find $WATCH_DIR -type f -newer /tmp/watch_dir.before -exec ls -l {} \;
What this won't do is show any files that were deleted, so perhaps a hybrid list could be used.
Here is how I got it to work. It's also setup up so that you can have multiple watched directories with the same script with cron.
for example, if you wanted one to run every minute.
* * * * * /usr/local/bin/watchdir.sh /makepdf
and one every hour.
0 * * * * /user/local/bin/watchdir.sh /incoming
#!/bin/bash
WATCHDIR="$1"
NEWFILESNAME=.newfiles$(basename "$WATCHDIR")
if [ ! -f "$WATCHDIR"/.oldfiles ]
then
ls -A "$WATCHDIR" > "$WATCHDIR"/.oldfiles
fi
ls -A "$WATCHDIR" > $NEWFILESNAME
DIRDIFF=$(diff "$WATCHDIR"/.oldfiles $NEWFILESNAME | cut -f 2 -d "")
for file in $DIRDIFF
do
if [ -e "$WATCHDIR"/$file ];then
#do what you want to the file(s) here
echo $file
fi
done
rm $NEWFILESNAME

Bash diff that stops when it finds the first difference

I have this script that I use for backups. The problem is that it is kind of slow. I want to know if there is a diff command that stops when finds the first difference.
DocumentsFiles=("Books" "Comics" "Distros" "Emulators" "Facturas" "Facultad" "Laboral" "Mods" "Music" "Paintings" "Projects" "Scripts" "Tesis" "Torrents" "Utilities")
OriginDocumentsFile="E:\Documents\\"
DestinationDocumentsFile="F:\Files\Documents\\"
## loop file to file and copy in backup
for directory in "${DocumentsFiles[#]}"
do
RealOrigin="${OriginDocumentsFile}${directory}"
RealDestination="${DestinationDocumentsFile}${directory}"
echo $directory
if [ -a "$RealDestination" ]; then
echo ok
if diff -r $RealOrigin $RealDestination; then
echo "${directory} are equal!"
else
rm -rfv $RealDestination
cp -ruv $RealOrigin "${DestinationDocumentsFile}"
fi
else
cp -ruv $RealOrigin "${DestinationDocumentsFile}"
fi
done
diff -q reports "only when files differ" (per man diff), so I believe it'll stop after the first difference.
But this is a bit of an XY problem. Really you need a better backup program like rsync:
It is famous for its delta-transfer algorithm, which reduces the amount of data sent over the network by sending only the differences between the source files and the existing files in the destination.
From man rsync

Monitor Pre-existing and new files in a directory with bash

I have a script using inotify-tool.
This script notifies when a new file arrives in a folder. It performs some work with the file, and when done it moves the file to another folder. (it looks something along these line):
inotifywait -m -e modify "${path}" |
while read NEWFILE
work on/with NEWFILE
move NEWFILE no a new directory
done
By using inotifywait, one can only monitor new files. A similar procedure using for OLDFILE in path instead of inotifywait will work for existing files:
for OLDFILE in ${path}
do
work on/with OLDFILE
move NEWFILE no a new directory
done
I tried combining the two loops. By first running the second loop. But if files arrive quickly and in large numbers there is a change that the files will arrive wile the second loop is running. These files will then not be captured by neither loop.
Given that files already exists in a folder, and that new files will arrive quickly inside the folder, how can one make sure that the script will catch all files?
Once inotifywait is up and waiting, it will print the message Watches established. to standard error. So you need to go through existing files after that point.
So, one approach is to write something that will process standard error, and when it sees that message, lists all the existing files. You can wrap that functionality in a function for convenience:
function list-existing-and-follow-modify() {
local path="$1"
inotifywait --monitor \
--event modify \
--format %f \
-- \
"$path" \
2> >( while IFS= read -r line ; do
printf '%s\n' "$line" >&2
if [[ "$line" = 'Watches established.' ]] ; then
for file in "$path"/* ; do
if [[ -e "$file" ]] ; then
basename "$file"
fi
done
break
fi
done
cat >&2
)
}
and then write:
list-existing-and-follow-modify "$path" \
| while IFS= read -r file
# ... work on/with "$file"
# move "$file" to a new directory
done
Notes:
If you're not familiar with the >(...) notation that I used, it's called "process substitution"; see https://www.gnu.org/software/bash/manual/bash.html#Process-Substitution for details.
The above will now have the opposite race condition from your original one: if a file is created shortly after inotifywait starts up, then list-existing-and-follow-modify may list it twice. But you can easily handle that inside your while-loop by using if [[ -e "$file" ]] to make sure the file still exists before you operate on it.
I'm a bit skeptical that your inotifywait options are really quite what you want; modify, in particular, seems like the wrong event. But I'm sure you can adjust them as needed. The only change I've made above, other than switching to long options for clarity/explicitly and adding -- for robustness, is to add --format %f so that you get the filenames without extraneous details.
There doesn't seem to be any way to tell inotifywait to use a separator other than newlines, so, I just rolled with that. Make sure to avoid filenames that include newlines.
By using inotifywait, one can only monitor new files.
I would ask for a definition of a "new file". The man inotifywait specifies a list of events, which also lists events like create and delete and delete_self and inotifywait can also watch "old files" (beeing defined as files existing prior to inotifywait execution) and directories. You specified only a single event -e modify which notifies about modification of files within ${path}, it includes modification of both preexisting files and created after inotify execution.
... how can one make sure that the script will catch all files?
Your script is just enough to catch all the events that happen inside the path. If you have no means of synchronization between the part that generates files and the part that receives, there is nothing you can do and there always be a race condition. What if you script receives 0% of CPU time and the part that generates the files will get 100% of CPU time? There is no guarantee of cpu time between processes (unless using certified real time system...). Implement a synchronization between them.
You can watch some other event. If the generating sites closes files when ready with them, watch for the close event. Also you could run work on/with NEWFILE in parallel in background to speed up execution and reading new files. But if the receiving side is slower then the sending, if your script is working on NEWFILEs slower then the generating new files part, there is nothing you can do...
If you have no special characters and spaces in filenames, I would go with:
inotifywait -m -e modify "${path}" |
while IFS=' ' read -r path event file ;do
lock "${path}"
work on "${path}/${file}"
ex. mv "${path}/${file}" ${new_location}
unlock "${path}"
done
where lock and unlock is some locking mechanisms implemented between your script and the generating part. You can create a communication between the-creation-of-files-process and the-processing-of-the-files-process.
I think you can use some transaction file system, that would let you to "lock" a directory from the other scripts until you are ready with the work on it, but I have no experience in that field.
I tried combining the two loops. But if files arrive quickly and in large numbers there is a change that the files will arrive wile the second loop is running.
Run the process_new_file_loop in background prior to running the process_old_files_loop. Also it would be nice to make sure (ie. synchronize) that inotifywait has successfully started before you continue to the processing-existing-files-loop so that there is also no race conditions between them.
Maybe a simple example and/or startpoint would be:
work() {
local file="$1"
some work "$file"
mv "$file" "$predefiend_path"
}
process_new_files_loop() {
# let's work on modified files in parallel, so that it is faster
trap 'wait' INT
inotifywait -m -e modify "${path}" |
while IFS=' ' read -r path event file ;do
work "${path}/${file}" &
done
}
process_old_files_loop() {
# maybe we should parse in parallel here too?
# maybe export -f work; find "${path} -type f | xargs -P0 -n1 -- bash -c 'work $1' -- ?
find "${path}" -type f |
while IFS= read -r file; do
work "${file}"
done
}
process_new_files_loop &
child=$!
sleep 1
if ! ps -p "$child" >/dev/null 2>&1; then
echo "ERROR running processing-new-file-loop" >&2
exit 1
fi
process_old_files_loop
wait # wait for process_new_file_loop
If you really care about execution speeds and want to do it faster, change to python or to C (or to anything but shell). Bash is not fast, it is a shell, should be used to interconnect two processes (passing stdout of one to stdin of another) and parsing a stream line by line while IFS= read -r line is extremely slow in bash and should be generally used as a last resort. Maybe using xargs like xargs -P0 -n1 sh -c "work on $1; mv $1 $path" -- or parallel would be a mean to speed things up, but an average python or C program probably will be nth times faster.
A simpler solution is to add an ls in front of the inotifywait in a subshell, with awk to create output that looks like inotifywait.
I use this to detect and process existing and new files:
(ls ${path} | awk '{print "'${path}' EXISTS "$1}' && inotifywait -m ${path} -e close_write -e moved_to) |
while read dir action file; do
echo $action $dir $file
# DO MY PROCESSING
done
So it runs the ls, format the output and sends it to stdout, then runs the inotifywait in the same subshell sending the output also to stdout for processing.

Tesseract OCR large number of files

I have around 135000 .TIF files (1.2KB to 1.4KB) sitting on my hard drive. I need to extract text out of those files. If I run tesseract as a cron job I am getting 500 to 600 per hour at the most. Can anyone suggest me strategies so I can get atleast 500 per minute?
UPDATE:
Below is my code after implementing on suggestions given by #Mark still I dont seem to go beyond 20 files per min.
#!/bin/bash
cd /mnt/ramdisk/input
function tess()
{
if [ -f /mnt/ramdisk/output/$2.txt ]
then
echo skipping $2
return
fi
tesseract --tessdata-dir /mnt/ramdisk/tessdata -l eng+kan $1 /mnt/ramdisk/output/$2 > /dev/null 2>&1
}
export -f tess
find . -name \*.tif -print0 | parallel -0 -j100 --progress tess {/} {/.}
You need GNU Parallel. Here I process 500 TIF files of 3kB each in 37s on an iMac. By way of comparison, the same processing takes 160s if done in a sequential for loop.
The basic command looks like this:
parallel --bar 'tesseract {} {.} > /dev/null 2>&1' ::: *.tif
which will show a progress bar and use all available cores on your machine. Here it is in action:
If you want to see what it would do without actually doing anything, use parallel --dry-run.
As you have 135,000 files it will probably overflow your command line length - you can check with sysctl like this:
sysctl -a kern.argmax
kern.argmax: 262144
So you need to pump the filenames into GNU Parallel on its stdin and separate them with null characters so you don't get problems with spaces:
find . -iname \*.tif -print0 | parallel -0 --bar 'tesseract {} {.} > /dev/null 2>&1'
If you are dealing with very large numbers of files, you probably need to consider the possibility of being interrupted and restarted. You could either mv each TIF file after processing to a subdirectory called processed so that it won't get done again on restarting, or you could test for the existence of the corresponding txt file before processing any TIF like this:
#!/bin/bash
doit() {
if [ -f "${2}.txt" ]; then
echo Skipping $1...
return
fi
tesseract "$1" "$2" > /dev/null 2>&1
}
export -f doit
time parallel --bar doit {} {.} ::: *.tif
If you run that twice in a row, you will see it is near instantaneous the second time because all the processing was done the first time.
If you have millions of files, you could consider using multiple machines in parallel, so just make sure you have ssh logins to each of the machines on your network and then run across 4 machines, including the localhost like this:
parallel -S :,remote1,remote2,remote3 ...
where : is shorthand for the machine on which you are running.

Are glob expressions subject to caching? How can a refresh be forced?

My script downloads files from a web server in an infinite loop. My script calls wget to get the newest files (ones I haven't gotten before), then each new file needs to be processed. The problem is that after running wget, the files have been properly downloaded (based on an ls in a separate window), but sometimes my script (specifically, the line beginning for curFile in) sees them and sometimes it doesn't, which makes me think it is sometimes looking at an outdated cache.
while [ 5 -lt 10 ]; do
timestamp=(date +%s)
wget -mbr -l0 --no-use-server-timestamps --user=username --password=password ftp://ftp.mysite.com/public_ftp/incoming/*.txt
for curFile in ftp.mysite.com/public_ftp/incoming/*.txt; do
curFileMtime=$(stat -c %W "$curFile")
if((curFileMtime > timestamp)); then
echo "$curFile"
cp "$curFile" CommLink/MDLFile
cd CommLink
SendMDLGetTab
cd ..
fi
done
sleep 120
done
The first few times through the loop this seems to work fine, then it becomes sporadic afterwards (sometimes it sees the new files and sometimes it doesn't). I've done a lot of googling, and found that bash does cache pathnames for use in running executables (so sometimes it tries to execute things that aren't there, if the executable has been recently removed) but I haven't found anything on caching non-executable filenames, which would result in it not seeing things that are there. Any ideas? If it is a caching problem, how can I force it to not look at the cache?
As the most immediate issue -- the -b argument to wget tells it to run in the background. Thus, with this flag set, the first subsequent command takes place while wget is still running.
Beyond that: Results of glob expressions -- such as ftp.mysite.com/public_ftp/incoming/*.txt -- are not cached by the shell. However, this glob is only evaluated once per loop: If a new text file isn't present at the start of that loop, it won't be picked up until the next iteration.
However, the mechanism the code in the question uses for excluding files already present before wget was run is prone to race conditions. I would suggest the following instead:
while IFS= read -r -d '' filename; do
[[ "$filename" -nt CommLink/MDLFile ]] || continue # retest in case MDLFile has changed
cp -- "$filename" CommLink/MDLFile && {
touch -r "$filename" CommLink/MDLFile # copy mtime to destination
(cd CommLink && exec SendMDLGetTab) # scope cd to subshell
}
done < <(find ftp.mysite.com/public_ftp/incoming/ \
-name '*.txt' \
-newer CommLink/MDLFile \
-print0)
Some of the finer points:
The code above compares timestamps to the current copy of MDLFile, rather than to the beginning of the current iteration of the loop. This is more robust in terms of ensuring that updates are processed if a prior invocation of this script was interrupted after the wget but before the cp.
Using touch -r ensures that the new copy of MDLFile retains the mtime of the original. (One might replace the cp with ln -f to hardlink the inode to get this same effect without any race conditions and while only storing MDLFile once on-disk, if the side effects are acceptable).
The code above only performs operations intended to be run inside a subdirectory if the cd into that subdirectory succeeded, and scopes that cd by performing operations intended to be performed in a separate directory in a subshell. (The cost of this subshell is offset by using exec when running the external command its ultimate intent is to trigger).
Using a NUL-delimited stream, as in find -print0 | while IFS= read -r -d '' filename, ensures that all possible names -- including names with literal newlines -- can be correctly handled.
Whether timestamps are stored at or beyond integer-level resolution varies by filesystem; however, bash only supports integer math. The above -- wherein no numeric comparisons are performed by the shell -- is thus somewhat more robust.

Resources