Faster iteration over thousands of files - bash

I'm trying to do something on ~200,000 files in a single folder.
When I do this:
for i in *; do /bin/echo -n "."; done
One dot is printed every few seconds. The same operation on a folder with a hundred files works blazingly fast.
Why is this so? How to accelerate the process for folders with thousands of files?

Try this with GNU find:
find . -maxdepth 1 -type f -printf "."

Related

how to count files only in specific subdirectories located deeply in the hierarchy?

I need to count all sessions files sess_* located in TMP directories (Debian machine) and know path to each TMP with the count for each one.
All parent direcotries are in /somepath/to/clientsDirs.
The directory structure for one client is
../ClientDirX/webDirYX/someDirZx
../ClientDirX/webDirYX/someDirZy
../ClientDirX/webDirYX/tmp
../ClientDirX/webDirYX/someDirZz
../ClientDirX/webDirYX/...
../ClientDirX/webDirYX/someDirZN
../ClientDirX/webDirYY/someDirZx
../ClientDirX/webDirYY/someDirZy
../ClientDirX/webDirYY/tmp
../ClientDirX/webDirYY/someDirZz
../ClientDirX/webDirYY/...
../ClientDirX/webDirYY/someDirZN
all someDirZ and tmp directories have a various count of subdirectories. Sessions files are in tmp dir only and not in tmp subdirectories. In one tmp dir could be more than millions sess_* files, so the solution needs to be very time effective.
X, YY, etc. in directory names are always numbers, but not in a continuous line, e.g.:
ClientDir1/webDir3/*
ClientDir4/webDir31/*
ClientDir4/webDir35/*
ClientDir18/webDir2/*
Could you please help me count all sess_* files in each tmp dir by command line or bash script?
EDIT: change of answer after changing the sense of a question
The whole task is divided into 3 parts.
I changed the directory names to simpler.
1.Build a list of tmp directories to search (first script)
#!/bin/bash
find /var/log/clients/sd*/wd*/ -maxdepth 1 -type d -name "tmp" >list
explanation
-type d only search for directories
-maxdpth 1 specifies the maximum search depth
-name specifies the name of the items sought
>list redirects the result to the list file
* it is so-called shell globbing in this case means
any string of characters
We perform this task for two reasons in a separate file. First of all, the execution time will be significant. Secondly, the list of customers does not change very often and it makes no sense to check it every time.
2.iterating loop over list items in bash (see finaly script)
3.search for sess_* files in the tmp directory without including subdirectories
find /path/to/tmp -maxdepth 1 -type f -name "sess_*" -exec printf "1" \; |wc -c
explanation
-type f only searches files
-exec executes any system command in this case, printf
\; necessary part ending the -exec command, must contain a space!
-exec printf is used because not every version of find has a printf command built in, so this will also work on busyboxes or outside of the GNU world
If your find has printf, use it instead of -exec (-printf "1")
For more, see command man find
Finally the second script:
#!/bin/bash
for x in `cat list`
do
printf "%s \t" $x
find $x -maxdepth 1 -type f -name "sess_*" -exec printf "1" \; | wc -c
done
Example result:
/var/log/clients/sd1/wd1/tmp 3
/var/log/clients/sd2/wd1/tmp 62
EDIT:
Note in some versions find GNU (eg 4.7.0-git) when the order -maxdepth 1 changes the -type f program throws worning or does not work. It seems that these versions do not use the getopt mechanism for some reason. Other versions of find do not seem to have this problem.

find seems to be much slower with -print0 option

I am trying to resize photos larger than specific dimensions for 100s of thousands of photos collected by a system over past 10 years. I am using find and imagemagick
I wrote this script to do it.
#!/bin/bash
ResizeSize="1080^>"
Processing=0
find . -type f -iname '*JPG' -print0 | \
while IFS= read -r -d '' image; do
((Processing++))
echo Processing file: $Processing
echo Resizing """$image"""
convert """$image""" -resize $ResizeSize """$image""___"
if [ $? -eq 0 ] ; then
rm """$image"""
if [ $? -eq 0 ] ; then
mv """$image""___" """$image"""
else
echo something wrong with resize
exit 1
fi
done
The script works on a small amount of files but it takes a long time to start with lots of files. I have tested on the command line find . -type f -iname '*JPG' -print0 vs find . -type f -iname '*JPG'. The later finds files within a few seconds but the former takes minutes before anything is found? Unfortunately the -print0 is required for dealing with filenames with special characters (which are mainly spaces in my case). How can I get this script to be more efficient?
I can not reproduce the behavior you're experiencing, but can think of two possible explanations.
First, you might be experiencing positive effects of page (disk) caching.
When you call find for the first time, it traverses files (metadata in inodes), actually reading from the data media (HDD) via kernel syscall. But kernel (transparently to find, or other applications) also stores that data in unused areas of memory, which acts as a cache. If this data is read again later, it can be quickly read from this cache in memory. This is called page caching.
So, your second call to find (no matter what output separator is used) will be a lot faster, assuming you are searching over the same files, with the same criteria.
Second, since find's output might be buffered, if your files are in many different locations, it might take some time before the actual first output to the while command. Also if the output is line-buffered, that would explain why -print0 variant takes longer to produce the first output (since there are no lines at all).
You can try running find with unbuffered output, via stdbuf command:
stdbuf -o0 find . -iname '*.jpg' -type f -print0 ...
One more thing, unrelated to this; to speed-up your find search, you might want to consider calling it like this:
find . -iname '*.jpg' -type f -print0
Here we put the -iname test before the -type test in order to avoid having to call stat(2) on every file. Even better would be to remove the -type test all together, if possible.

Bash: How to control iteration flow/loops?

For going over some recovered data, I am working on a script that recursively goes through folders & files and finally runs file on them, to check if they are likely fully recovered from a certain backup or not. (recovered files play, and are identified as mp3 or other audio, non-working files as ASCII-Text)
For now I would just be satisfied with having it go over my test folder structure, print all folders & corresponding files. (printing them mainly for testing, but also because I would like to log where the script currently is and how far along it is in the end, to verify what has been processed)
I tried using 2 for loops, one for the folders, then one for the files. (so that ideally it would take 1 folder, then list the files in there (or potentially delve into subfolders) and below each folder only give the files in that subfolders, then moving on to the next.
Such as:
Folder1
- File 1
- File 2
-- Subfolder
-- File3
-- File4
Folder2
- File5
However this doesn't seem to work in the ways (such with for loops) that are normally proposed. I got as far as using "find . -type d" for the directories and "find . -type f" or "find * -type f" (so that it doesn't go in to subdirectories) However, when just printing the paths/files in order to check if it ran as I wanted it to, it became obvious that that didn't work.
It always seemed to first print all the directories (first loop) and then all the files (second loop). For keeping track of what it is doing and for making it easier to know what was checked/recovered I would like to do this in a more orderly fashion as explained above.
So is it that I just did something wrong, or is this maybe a general limitation of the for loop in bash?
Another problem that could be related: Although assigning the output of find to an array seemed to work, it wasn't accessible as an array ...
Example for loop:
for folder in '$(find . -type d)' ; do
echo $folder
let foldercounter++
done
Arrays:
folders=("$(find . -type d)")
#As far as I know this should assign the output as an array
#However, it is not really assigned properly somehow as
echo "$folders[1]"
# does not work (quotes necessary for spaces)
A find ... -exec ... solution #H.-Dirk Schmitt was referring to might look something like:
find . -type f -exec sh -c '
case $(file "$1") in
*Audio file*)
echo "$1 is an audio file"
;;
*ASCII text*)
echo "$1 is an ascii text file"
;;
esac
' _ {} ';'
For going over some recovered data, I am working on a script that recursively goes through folders & files and finally runs file on them, to check if they are likely fully recovered from a certain backup or not. (recovered files play, and are identified as mp3 or other audio, non-working files as ASCII-Text)
If you want to run file on every file and directory in the current directory, including its subdirectories and so on, you don't need to use a Bash for-loop, because you can just tell find to run file:
find -exec file '{}' ';'
(The -exec ... ';' option runs the command ... on every matched file or directory, replacing the argument {} with the path to the file.)
If you only want to run file on regular files (not directories), you can specify -type f:
find -type f -exec file '{}' ';'
If you (say) want to just print the names of directories, but run the above on regular files, you can use the -or operator to connect one directive that uses -type d and one that uses -type f:
find -type d -print -or -type f -exec file '{}' ';'
Edited to add: If desired, the effect of the above commands can be achieved in pure Bash (plus the file command, of course), by writing a recursive shell function. For example:
function foo () {
local file
for file in "$1"/* ; do
if [[ -d "$file" ]] ; then
echo "$file"
foo "$file"
else
file "$file"
fi
done
}
foo .
This differs from the find command in that it will sort the files more consistently, and perhaps in gritty details such as handling of dot-files and symbolic links, but is broadly the same, so may be used as a starting-point for further adjustments.

Moving files that have no extensions in a number of directories into one directory

I have a large collection of ~1,000 files (without extension, i.e. 1105, 1106,5231, etc.) spread across a corresponding number of folders. i.e. it is a thousand files like so:
/users/me/collection/1105/1455,/users/me/collection/1106/1466,/users/me/collection/1110/1470, etc. etc.
What I want to do is to find a quick way to move all these files in the sub directories (i.e. 1455, 1466, 1470, etc.) into one single directory (i.e. /users/me/collection-all/).
To be honest, the lack of an extension is throwing me off, and I seem to keep finding directories alongside the files... They are actually all PDFs, but w/o extension.
In fact the answer is very simple :
you can find them and exclude the directories :
cp ` find <your directory tree base> ! -type d` <your destination directory>
The "! -type d" will naturally exclude the results of type "directory".
HTH
How about this?
mv /users/me/collection/*/* /users/me/collection-all/
My two cents
cd /users/collections/me
find . -type f -exec mv {} /users/me/collection-all/ \;
You can try this:
#!/bin/bash
IFS_BACKUP=$IFS
IFS=$'\n\t'
for i in $(find $source_directory -type f -iname "*" -print)
do
mv -nv "$i" "$target_directory"
done
IFS=$IFS_BACKUP
exit
Just make sure to replace the variables $source_directory and $target_directory with the appropriate paths.

Concatenating Thousands of Text Files Across Hundreds of Directories (while keeping some structure)

I have a set of plain text files spread across 400+ directories, with tons and tons of subdirectories. There are about 300,000 text files. For example:
directory1/subdirectory1
directory1/subdirectory2
directory1/subdirectory1/subdirectory3
All of those text files within directory1 should end up in one big massive text file named directory1.txt. Then repeat with directory2.
What would be the quickest and simplest way to go into each of these four hundred directories and combine all of the text files in such a manner?
I know I could go to each of the four hundred directories and use the commands such as find to bring all the text files together into one directory, and then use cat *.txt >> all.txt, but surely there must be an easy way to automate this process?
To concatenate all txt-files in a subtree:
#!/bin/sh
# Usage: cat-txt dirname
find "$1" -name \*.txt -print0 | xargs -0 cat >> "$1.txt"
Call cat-txt on all immediate subdirectories:
$ find -mindepth 1 -maxdepth 1 -type d -exec cat-txt '{}' \;
I've tested this on my system, and it works flawlessly. You may want to tweak it to your parameters, but in one line I did everything you needed.
for I in `ls -dR */`; do cat $I/* > $I.txt; done
You may to change the ls command to only search text files, otherwise you'll get binary data as well. Enjoy.
for I in `ls -dR */*.txt`; do cat $I/* > $I.txt; done
there must be an easy way to automate this process?
Why are you looking for one? Is this a one time activity or you're gonna repeat it every now and then?
I would just say go with something as simple as:
for ff in `find . -maxdepth 1 -type d`
do
find "$ff" -type *.txt -exec cat {} \; >> "$ff.txt"
done

Resources