I have a directory containing hundreds of thousands of PDF files with quite complex names. I need to be able to move SOME (not all files) from the directory they're in to another directory. Here is an example of my .sh script that handles it:
#!/bin/bash
/usr/bin/echo "Moving subset 300-399"
# 300-399
/usr/bin/mv *-*-*-3[0-9][0-9]-*-*-*-*.pdf ../destination_folder/
/usr/bin/echo "Moving subset 450-499"
# 450-499
/usr/bin/mv *-*-*-4[5-9][0-9]-*-*-*-*.pdf ../destination_folder/
/usr/bin/echo "Moving subset 500-599"
# 500-599
/usr/bin/mv *-*-*-5[0-9][0-9]-*-*-*-*.pdf ../destination_folder/
Because there are so many files and I think that mv is performing an evaluation on every single one, it's taking upwards of two hours to perform the work. This is a script that must be run EVERY day, so I need to find a more efficient way to do the work. Is there a more efficient command I can utilize in a Windows environment or a more efficient way I can evaluate each file in order to speed up the mv process?
As mentioned in the comments, powershell will probably be faster as it is native to windows. The difference in speed will be dependent on the implementation of bash you are using.
For a pure bash solution, you can try :
#!/bin/bash
find /input/folder -regextype posix-extended -regex '^(?:[^-]+-){3}(?:4[5-9]|[35][0-9])[0-9](?:-[^-]+){4}\.pdf$' -exec mv {} /destination/folder +
Explanation :
find /input/folder -regextype posix-extended -regex :
find every file in your input folder that match the regex
'^(?:[^-]+-){3}(?:4[5-9]|[35][0-9])[0-9](?:-[^-]+){4}\.pdf$'
the pattern matching your files. More explanations here
-exec mv {} /destination/folder +
execute the mv command on every file found
the + symbol means the command will be executed in as few calls as possible, when the find command has discovered every file matching the regex
It is worth to mention that the duration of these mv commands depends on the amount of data of course: the total size of the pdf files in the current directory.
Please, note that mv command has at least 2 different behaviors with different performances, depending on the location of the ../destination_folder/ directory:
../destination_folder/ and *.pdf files on different file systems: the mv command is copying the files and then removing them from the source directory.
../destination_folder/ and *.pdf files on the same file system: only a rename is done which is super fast.
the df command can be used to display the ../destination_folder/ directory very nature.
Should you could choose the destination directory, then make sure it is located on the same file system: expect a great improvement.
In addition, if the ../destination_folder/ directory is located onto a remote server, the duration depends also on the network speed. If this is your situation, then compressing/uncompressing the files while moving should be tested: the performance can be much better.
If you have bash on Windows, you can run each in the background with the & suffix and try to parallelize it to achieve better performance. Use the wait keyword to wait for the background processes to complete. For example:
/usr/bin/echo "Moving subset 300-399"
/usr/bin/mv *-*-*-3[0-9][0-9]-*-*-*-*.pdf ../destination_folder/ & # Run this line in the background
# Other async calls
# Wait for background processes to finish
wait
If you want PowerShell, you can use Start-Job to run these in the background. To use your 300 subset as an example:
Write-Host "Moving subset 300-399"
$mv300jb = Start-Job {
$sourceFiles = Get-ChildItem -File .\*-*-*-3*-*-*-*-*.pdf | Where-Object {
$_.FullName -match '\\(\w+-){3}3[0-9]{2}(-\w+){4}\.pdf$'
}
Move-Item -Path $sourceFiles "..\destination_folder"
}
# Here you would also start other async jobs, assigning $mv400, $mv500, etc. like above
...
# Wait for job to complete
while( $mv300.State -notin 'Completed', 'Failed' ) {
Start-Sleep 30 # Change this to number of seconds to poll job again
}
Honorable mention
A second alternative on Windows would be to use robocopy.exe which copies and moves files more performantly than the standard copy and move commands. The /mt parameter will make use of multi-threading. Unfortunately, I don't have any robocopy examples to share here.
Explaining the regex
Note: I have since learned that you can use basic character ranges with Get-ChildItem and some other PowerShell cmdlets which support globbing. See my edit at the bottom of this answer for more information.
Since asked, here's a breakdown of the .NET regex I used to match on the filename:
\\(\w+-){3}3[0-9]{2}(-\w+){4}\.pdf$
\\: Literal \ character
(\w+-): Looks for group of one or more \w word-characters followed by a -
{3}: Quantifier to match on exactly 3 occurrences of the previous group
3[0-9]: Looks for literal 3 followed by a digit character
{2}: Quantifier to match on exactly two preceeding digit characters
(-\w+): Looks for group of one or more - characters followed by at least one word-character \w.
{4}: Quantifier to match exactly 4 occurrences of the previous group
\.pdf: Literal . character followed by pdf
$: End of input/string
At this time of writing I was unaware character ranges can be used with globbing in Get-ChildItem, so I resorted to using a regular expression to find the exact number of fields matching the specific number pattern in the 4th field, while ensuring the 8-field filename was intact for any found files.
If you plug this expression into https://regexr.com, it will break the expression down and explain everything better visually than I can here, without making this answer too long.
EDIT
As I learned the other day, you can use character ranges with PowerShell's file matching, though this doesn't work in other contexts within Windows. In my example above the following line can be modified to match letter and number ranges as well without having to use regex. If you take the following code from above:
$sourceFiles = Get-ChildItem -File .\*-*-*-3*-*-*-*-*.pdf | Where-Object {
$_.FullName -match '\\(\w+-){3}3[0-9]{2}(-\w+){4}\.pdf$'
}
we can use globbing to match on the filename without having to use the Where-Object or regular expression, greatly reducing the complexity of this bit:
$sourceFiles = Get-ChildItem -File .\*-*-*-3[0-9][0-9]*-*-*-*-*.pdf
Here is the modified code for eschewing the regex in favor of globbing:
Write-Host "Moving subset 300-399"
$mv300jb = Start-Job {
$sourceFiles = Get-ChildItem -File .\*-*-*-3*-*-*-*-*.pdf
Move-Item -Path $sourceFiles "..\destination_folder"
}
# Here you would also start other async jobs, assigning $mv400, $mv500, etc. like above
...
# Wait for job to complete
while( $mv300.State -notin 'Completed', 'Failed' ) {
Start-Sleep 30 # Change this to number of seconds to poll job again
}
The availability of this feature seems to hinge on whether a PowerShell construct is performing the globbing (it works) or if it is native to the Win32 API (does not work). In other words, it seems to be supported by PowerShell but not by other Windows APIs.
Related
Why does assigning command output work in some cases and seemingly not in others? I created a minimal script to show what I mean, and I run it in a directory with one other file in it, a.txt. Please see the ??? in the script below and let me know what's wrong, perhaps try it. Thanks.
#!/bin/bash
## setup so anyone can copy/paste/run this script ("complete" part of MCVE)
tempdir=$(mktemp -d "${TMPDIR:-/tmp}"/demo.XXXX) || exit # make a temporary directory
trap 'rm -rf "$tempdir"' 0 # delete temporary directory on exit
cd "$tempdir" || exit # don't risk changing non-temporary directories
touch a.txt # create a sample file
cmd1="find . -name 'a*' -print"
eval $cmd1 # this produces "./a.txt" as expected
res1=$($cmd1)
echo "res1=$res1" # ??? THIS PRODUCES ONLY "res1=" , $res1 is blank ???
# let's try this as a comparison
cmd2="ls a*"
res2=$($cmd2)
echo "res2=$res2" # this produces "res2=a.txt"
Let's look at exactly what this does:
cmd1="find . -name 'a*' -print"
res1=$($cmd1)
echo "res1=$res1" # ??? THIS PRODUCES ONLY "res1=" , $res1 is blank ???
As per BashFAQ #50, execution of res1=$($cmd1) does the following, assuming you have no files with names starting with 'a and ending with ' (yes, with single quotes as part of the name), and that you haven't enabled the nullglob shell option:
res1=$( find . -name "'a*'" -print )
Note the quoting around than name? That quoting represents that the 's are treated as data, rather than syntax; thus, rather than having any effect on whether the * is expanded, they're simply an additional element required to be part any filename for it to match, which is why you get a result with no matches at all. Instead, as the FAQ tells you, use a function:
cmd1() {
find . -name 'a*' -print
}
res1=$(cmd1)
...or an array:
cmd1=( find . -name 'a*' -print )
res1=$( "${cmd1[#]}" )
Now, why does this happen? Read the FAQ for a full explanation. In short: Parameter expansion happens after syntactic quotes have already been applied. This is actually a Very Good Thing from a security perspective -- if all expansions recursively ran through full parsing, it would be impossible to write secure code in bash handling hostile data.
Now, if you don't care about security, and you also don't care about best practices, and you also don't care about being able to correctly interpret results with unusual filenames:
cmd1="find . -name 'a*' -print"
res1=$(eval "$cmd1") # Force parsing process to restart from beginning. DANGEROUS if cmd1
# is not static (ie. constructed with user input or filenames);
# prone to being used for shell injection attacks.
echo "res1=$res1"
...but don't do that. (One can get away with sloppy practices only until one can't, and the point when one can't can be unpleasant; for the sysadmin staff at one of my former jobs, that point came when a backup-maintenance script deleted several TB worth of billing data because a buffer overflow had placed random garbage in the name of a file that was due to be deleted). Read the FAQ, follow the practices it contains.
Ok, to be clear, this is a school assignment, and I don't need the entire code. The problem is this: I use
set subory = ("$subory:q" `sh -c "find '$cesta' -type f 2> /dev/null"`)
to fill variable subory with all ordinary files in specified path. Then I have a foreach where I count lines of all files in a directory, that's not the problem. Problem is, that when this script is tested, some big directories are use as the path. What happens is that the script doesn't finish, but gives error message word too long. That word is subory. This is a real problem, because $cesta can be an element of a long list of paths. I tried, but I cannot solve this problem. Any ideas? I'm a bit lost.
EDIT: To be clear, the task is to assign each directory a number, that represents the total line count of all it's files, and then pick the directory with greatest number.
You need to reorganize your code. For example:
find "$cesta" -type f -execdir wc -l {} +
This will run wc on all the files found, without ever running afoul of command line-length limitations, "invalid" characters like newlines in filenames, etc. And you don't need to spawn a new shell to do it.
Its my first time to use BASH scripting and been looking to some tutorials but cant figure out some codes. I just want to list all the files in a folder, but i cant do it.
Heres my code so far.
#!/bin/bash
# My first script
echo "Printing files..."
FILES="/Bash/sample/*"
for f in $FILES
do
echo "this is $f"
done
and here is my output..
Printing files...
this is /Bash/sample/*
What is wrong with my code?
You misunderstood what bash means by the word "in". The statement for f in $FILES simply iterates over (space-delimited) words in the string $FILES, whose value is "/Bash/sample" (one word). You seemingly want the files that are "in" the named directory, a spatial metaphor that bash's syntax doesn't assume, so you would have to explicitly tell it to list the files.
for f in `ls $FILES` # illustrates the problem - but don't actually do this (see below)
...
might do it. This converts the output of the ls command into a string, "in" which there will be one word per file.
NB: this example is to help understand what "in" means but is not a good general solution. It will run into trouble as soon as one of the files has a space in its nameāsuch files will contribute two or more words to the list, each of which taken alone may not be a valid filename. This highlights (a) that you should always take extra steps to program around the whitespace problem in bash and similar shells, and (b) that you should avoid spaces in your own file and directory names, because you'll come across plenty of otherwise useful third-party scripts and utilities that have not made the effort to comply with (a). Unfortunately, proper compliance can often lead to quite obfuscated syntax in bash.
I think problem in path "/Bash/sample/*".
U need change this location to absolute, for example:
/home/username/Bash/sample/*
Or use relative path, for example:
~/Bash/sample/*
On most systems this is fully equivalent for:
/home/username/Bash/sample/*
Where username is your current username, use whoami to see your current username.
Best place for learning Bash: http://www.tldp.org/LDP/abs/html/index.html
This should work:
echo "Printing files..."
FILES=(/Bash/sample/*) # create an array.
# Works with filenames containing spaces.
# String variable does not work for that case.
for f in "${FILES[#]}" # iterate over the array.
do
echo "this is $f"
done
& you should not parse ls output.
Take a list of your files)
If you want to take list of your files and see them:
ls ###Takes list###
ls -sh ###Takes list + File size###
...
If you want to send list of files to a file to read and check them later:
ls > FileName.Format ###Takes list and sends them to a file###
ls > FileName.Format ###Takes list with file size and sends them to a file###
I have a programme that is generating files like this "Incoming11781Arp", and there is always Incoming, and there is always 5 numbers, but there are 3 letters/upper-case/lower-case/numbers/special case _ in any way. Like Incoming11781_pi, or Incoming11781rKD.
How can I delete them using a script run from a cron job please? I've tried -
#!/bin/bash
file=~/Mail/Incoming******
rm "$file";
but it failed saying that there was no matching file or directory.
You mustn't double-quote the variable reference for pathname expansion to occur - if you do, the wildcard characters are treated as literals.
Thus:
rm $file
Caveat: ~/Mail/Incoming****** doesn't work the way you think it does and will potentially match more files than intended, as it is equivalent to ~/Mail/Incoming*, meaning that any file that starts with Incoming will match.
To only match files starting with Incoming that are followed by exactly 6 characters, use ~/Mail/Incoming??????, as #Jidder suggests in a comment.
Note that you could make your glob (pattern) even more specific:
file=~/Mail/Incoming[0-9][0-9][0-9][0-9][0-9][[:alpha:]_][[:alpha:]_][[:alpha:]_]
See the bash manual for a description of pathname expansion and pattern syntax: http://www.gnu.org/software/bash/manual/bashref.html#index-pathname-expansion.
You can achieve the same effect with the find command...
$ directory='~/Mail/'
$ file_pattern='Incoming*'
$ find "${directory}" -name "${file_pattern}" -delete
The first two lines define the directory and the file pattern separately, the find command will then proceed to delete any matching files inside that directory.
I have a shell script, which will use some * to do wildcard. For example:
mv /someplace/*.DAT /someotherplace
And
for file in /someplace/*.DAT
do
echo $file
done
Then when I think about error handling, I am worrying about the infamuse argument list too long error.
How much should I worry about it? Actually how long can the shell holds? For example, will it dies at 500 files or 1000 files? Does it depends on the length of the filenames?
EDIT:
I have found out the argument max is 131072 bytes. I am not looking for solution to overcome argument too long problem. What I really what to need is -- How long does it translate to normal string command? i.e : How "long" would that be the command? Does it count space?
pardon my ignorance
If i remember correctly, is capped at 32Kb of data
first command
find /someplace -name '*.DAT' -print0 | xargs -r0 mv --target='/someotherplace'
second command
find /someplace -type f -name "*.DAT"
Yes, it depends on filename length. The command line maximum is a single hardcoded limit, so long filenames will exhaust it faster. And it's usually a kernel limitation, so there is no way around it within bash. And yes, this is serious: errors that occur only infrequently are always more serious than obvious errors, because quality assurance will probably miss them, and when they do happen it is almost guaranteed to be with a nightmarish unreadable command line that you can't even reconstruct properly!
For all these reasons: deal with the problem now rather than later.
Whether
How much should you worry about it? You may as well ask "What is the lifespan of my code?"
I would urge you to always worry about the argument list limit. This limit is set at compile time and can easily be different on different systems, shells, etc.. Do you know for sure that your code will always run in its original environment with expected input and that environment's original limit?
If the expansion of a glob could result in an unknown number of files or files with an unknown length being expanded or that expansion could exceed the limit that will be in effect in any unknown future environment then you should write your code from day one so as to avoid this bug.
How
There are three find-based solution for this problem. The classic solution uses xargs
find ... | xargs command
xargs will execute command with as many matches as it can without overflowing the argument list, then repeat that invocation as necessary until there are no more results from find.
This solution is problematic because file names may contain newlines. If you're lucky you have a nicer version of find which supports null-terminating file names with -print0 and you can use the safer solution
find ... -print0 | xargs -0 command
This is the same as the first find except it's safe for all legal file names.
Newer versions of find may support -exec with the + terminator, which allows for another solution
find ... -exec command {} +
This is functionally identical to the second find command above: safe for all file names, splits invocations of command into chunks that won't overflow the argument list. I prefer this form, when available.