find directories but exclude list where directories have a space in name - bash

I have a process that audits files from one day to the next on a large file system. I want to exclude some directories from consideration by using a list of directories to exclude. I can do that just fine, but I'm having trouble if an exclude directory has a space in the name.
For simplicity's sake, I'm only going to list four sub-directories, but in reality there are many more directories I want to search vs exclude. There's also the chance that a new directory gets added and I want to automatically include new directories, hence the exclude list vs using an include list.
base_dir/
├── sub_dir1
├── sub_dir2
├── sub dir3
└── sub_dir4
I have a shell script and an exclude list
$ cat exclude.txt
sub_dir2
sub dir3
The shell script uses find and printf along with awk and sort to get a list of directories to audit.
$ find ./base_dir -maxdepth 1 -type d $(printf "! -iname %s " $(cat exclude.txt)) | awk -F/ '{print $NF}' | sort
sub_dir1
sub dir3
sub_dir4
As you can probably guess and see above, this works except that it's not ignoring sub dir3. I've tried a few combinations of double quotes inside exclude list and using %q vs %s vs %a, but can't seem to find the correct combination.
My desired output is
sub_dir1
sub_dir4
I realize I could do something like:
find ./base_dir -maxdepth 1 -type d \
! -iname "sub dir3" $(printf "! -iname %s " $(cat exclude.txt)) \
| awk -F/ '{print $NF}' | sort
and get my expected output, but I want to only use the exclude.txt list.
EDIT
After reading some replies I tried using an array and thought that would work, now it's even more obscure to me why this option doesn't work. printf appears to produce a string that would work if I strictly typed it into the command line, but when trying to run it as a one-liner still giving me errors.
$cat exclude.txt
base_dir
sub_dir2
"sub dir3"
$ mapfile -t exclude < exclude.txt
$printf "! -iname %s " "${exclude[#]}"
! -iname base_dir ! -iname sub_dir2 ! -iname "sub dir3"
$find ./base_dir -maxdepth 1 -type d $(printf "! -iname %s " "${exclude[#]}")
find: paths must precede expression: dir3"
$ find ./base_dir -maxdepth 1 -type d ! -iname base_dir ! -iname sub_dir2 ! -iname "sub dir3"
./base_dir/sub_dir1
./base_dir/sub_dir4

You could read the exclude file into a Bash array and then craft a find command like this:
mapfile -t exclude < exclude.txt
find ./base_dir \
-mindepth 1 \ # Exclude the current directory
-type d \
-regextype egrep \ # Make sure alternation "|" does not have to be escaped
! -iregex ".*/($(IFS='|'; echo "${exclude[*]}"))" \
-printf '%f\n' # Print just filename without leading directories
resulting in
sub_dir1
sub_dir4
For your example input, the -iregex test expands like this:
$ IFS='|'
$ echo "${exclude[*]}")
sub_dir2|sub dir3
so the regular expression for paths to exclude becomes
.*/(sub_dir2|sub dir3)
The change to IFS is limited to the command substitution.
The limitation to this is if the directories to be excluded contain characters that are special to regexes, you have to escape those, which can get messy. If you wanted to escape, for example, pipes, you could use
echo "${exclude[*]//|/\\|}"
in the command substitution, resulting in
sub_dir2|sub dir3|has\|pipe
where the directory has|pipe with a | in its name has its pipe properly escaped.

edited to include new info, in case it's useful later
Don't embed printf/cat. The interpreter parser is working against you.
Stack the exclusion filters with paste -s into a tempfile to build your command dynamically, then execute it.
$: find ./base_dir
./base_dir
./base_dir/sub dir1
./base_dir/sub dir3
./base_dir/sub_dir1
./base_dir/sub_dir3
$: tmpfile=/tmp/xFinder
$: printf "find ./base_dir -maxdepth 1 -type d ! -iname base_dir " > $tmpfile
$: { sed -E 's/^(.*)/! -iname \"\1\"/' exclude.txt;
printf " | xargs -I R basename R "; } | paste -s >> $tmpfile
$: cat $tmpfile
find ./base_dir -maxdepth 1 -type d ! -iname base_dir ! -iname "sub_dir1" ! -iname "sub dir3" ! -iname "sub_dir4" | xargs -I R basename R
The xargs call to basname strips the path info, and ! -iname base_dir keeps it out of the find output as a dir of it's own.
$: . $tmpfile
./base_dir
./base_dir/sub dir1
./base_dir/sub_dir3
Apologies for the earlier incomplete version.

Since you only want to limit to a single subdirectory, without recursion, you can use a for loop with whildcards:
$ find base_dir/
base_dir/
base_dir/sub_dir2
base_dir/sub_dir1
base_dir/sub_dir4
base_dir/sub dir3
$ cat exclude.txt
sub_dir2
sub dir3
$ cat script.sh
#!/bin/bash
for dir in base_dir/*
do
! [ -d "$dir" ] ||
grep -qFx -- "$(basename -- "$dir")" exclude.txt &&
continue
echo "$dir" # or do somthing else
done
$ ./script.sh
base_dir/sub_dir1
base_dir/sub_dir4

Related

How to get file count and names in directory on bash

I want to get the file count & file names & folder names in directory:
mkdir -p /tmp/test/folder1
mkdir -p /tmp/test/folder2
touch /tmp/test/file1
touch /tmp/test/file2
file_names=$(find "/tmp/test" -mindepth 1 -maxdepth 1 -type f -print0 | xargs -0 -I {} basename "{}")
echo $file_names
here is the output:
file2 file1
For folder:
folder_names=$(find "/tmp/test" -mindepth 1 -maxdepth 1 -type d -print0 | xargs -0 -I {} basename "{}")
echo $folder_names
here is the output:
folder2 folder1
For count:
file_count=0 && $(find "/tmp/test" -mindepth 1 -maxdepth 1 -type f -print0 | let "file_count=file_count+1")
echo $file_count
folder_count=0 && $(find "/tmp/test" -mindepth 1 -maxdepth 1 -type d -print0 | let "folder_count=folder_count+1")
echo $folder_count
The file_count and folder_count does not work
Question 1:
How to get the correct file_count and folder_count?
Question 2:
Is it possible for getting names into an array and check the count from array size?
The answer to the second question is really the answer to the first, too.
mapfile -d '' files < <( find /tmp/test -type f \
-mindepth 1 -maxdepth 1 \
-printf '%f\0')
echo "${#files} files"
printf '%s\n' "${files[#]}"
The use of double quotes and # in the array expansion are essential for printing file names with whitespace correctly. The use of a null byte terminator between file names ensures that even newlines in file names are disambiguated.
Notice also the use of -printf with a specific format string to avoid having to run basename separately. However, the -printf option and its various format strings, as well as the -print0 option you used, are a GNU find extension, and thus not portable. (Linux typically ships with GNU tools; on other platforms, they are obviously easy to install, but typically not installed out of the box.)
If you have an older version of Bash which doesn't support mapfiles, try an explicit loop:
files=()
while IFS= read -r -d $'\0' file; do
files+=("$file")
done < <(find ...)
If you don't have GNU find, a common workaround is to print a fixed string for each found file, and then the line or character count reliably reflects the number of found files.
find /tmp/test -type f \
-mindepth 1 -maxdepth 1 \
-exec printf . \; |
wc -c
Though then, how do you collect the file names? If (as in your case) you don't require recursion into subdirectories, simply loop over all items in the directory.
In which case, again, the number of items in the collected array will also tell you how many there are.
files=()
dirs=()
for item in /tmp/test/*; do
if [[ -f "$item"]]; then
files+=("$item")
elif [[ -d "$item" ]]; then
dirs+=("$item")
fi
done
echo "${#dirs[#] directories}
printf '- %s\n' "${dirs[#]}"
echo "${#files[#]} files"
printf '%s\n' "${dirs[#]}"
For a further discussion, see also https://mywiki.wooledge.org/BashFAQ/020
Needlessly collecting items into an array so you can loop over it once is a common beginner antipattern. If you just want to process each item in isolation and then forget it, there is no need to waste memory on remembering all the other items - just loop over them directly.
As an aside, running find in a subprocess will create a new shell instance with its own set of variables; thus in your attempt, the pipe to let would increment from 0 to 1 each time you ran it (though of course, piping to let also does not do anything useful anyway).

Show directory path with only files present in them

This is the folder structure that I have.
Using the find command find . -type d in root folder gives me the following result
Result
./folder1
./folder1/folder2
./folder1/folder2/folder3
However, I want the result to be only ./folder1/folder2/folder3. i.e only print the result if there's a file of type .txt present inside.
Can someone help with this scenario? Hope it makes sense.
find . -type f -name '*.txt' |
sed 's=/[^/]*\.txt$==' |
sort -u
Find all .txt files, remove file names with sed to get the parent directories only, then sort -u to remove duplicates.
This won’t work on file names/paths that contain a new line.
You may use this find command that finds all the *.txt files and then it gets unique their parent directory names:
find . -type f -name '*.txt' -exec bash -c '
for f; do
f="${f#.}"
printf "%s\0" "$PWD${f%/*}"
done
' _ {} + | awk -v RS='\0' '!seen[$0]++'
We are using printf "%s\0" to address directory names with newlines, spaces and glob characters.
Using gnu-awk to get only unique directory names printed
Using Associative array and Process Substitution.
#!/usr/bin/env bash
declare -A uniq_path
while IFS= read -rd '' files; do
path_name=${files%/*}
if ((!uniq_path["$path_name"]++)); then
printf '%s\n' "$path_name"
fi
done < <(find . -type f -name '*.txt' -print0)
Check the value of uniq_path
declare -p uniq_path
Maybe this POSIX one?
find root -type f -name '*.txt' -exec dirname {} \; | awk '!seen[$0]++'
* adds a trailing \n after each directory path
* breaks when a directory in a path has a \n in its name
Or this BSD/GNU one?
find root -type f -name '*.txt' -exec dirname {} \; -exec printf '\0' \; | sort -z -u
* adds a trailing \n\0 after each directory path

Terminal find, directories last instead of first

I have a makefile that concatenates JavaScript files together and then runs the file through uglify-js to create a .min.js version.
I'm currently using this command to find and concat my files
find src/js -type f -name "*.js" -exec cat {} >> ${jsbuild}$# \;
But it lists files in directories first, this makes heaps of sense but I'd like it to list the .js files in the src/js files above the directories to avoid getting my undefined JS error.
Is there anyway to do this or? I've had a google around and seen the sort command and the -s flag for find but it's a bit above my understanding at this point!
[EDIT]
The final solution is slightly different to the accepted answer but it is marked as accepted as it brought me to the answer. Here is the command I used
cat `find src/js -type f -name "*.js" -print0 | xargs -0 stat -f "%z %N" | sort -n | sed -e "s|[0-9]*\ \ ||"` > public/js/myCleverScript.js
Possible solution:
use find for getting filenames and directory depth, i.e find ... -printf "%d\t%p\n"
sort list by directory depth with sort -n
remove directory depth from output to use filenames only
test:
without sorting:
$ find folder1/ -depth -type f -printf "%d\t%p\n"
2 folder1/f2/f3
1 folder1/file0
with sorting:
$ find folder1/ -type f -printf "%d\t%p\n" | sort -n | sed -e "s|[0-9]*\t||"
folder1/file0
folder1/f2/f3
the command you need looks like
cat $(find src/js -type f -name "*.js" -printf "%d\t%p\n" | sort -n | sed -e "s|[0-9]*\t||")>min.js
Mmmmm...
find src/js -type f
shouldn't find ANY directories at all, and doubly so as your directory names will probably not end in ".js". The brackets around your "-name" parameter are superfluous too, try removing them
find src/js -type f -name "*.js" -exec cat {} >> ${jsbuild}$# \;
find could get the first directory level already expanded on commandline, which enforces the order of directory tree traversal. This solves the problem just for the top directory (unlike the already accepted solution by Sergey Fedorov), but this should answer your question too and more options are always welcome.
Using GNU coreutils ls, you can sort directories before regular files with --group-directories-first option. From reading the Mac OS X ls manpage it seems that directories are grouped always in OS X, you should just drop the option.
ls -A --group-directories-first -r | tac | xargs -I'%' find '%' -type f -name '*.js' -exec cat '{}' + > ${jsbuild}$#
If you do not have the tac command, you could easily implement it using sed. It reverses the order of lines. See info sed tac of GNU sed.
tac(){
sed -n '1!G;$p;h'
}
You could do something like this...
First create a variable holding the name of our output file:
OUT="$(pwd)/theLot.js"
Then, get all "*.js" in top directory into that file:
cat *.js > $OUT
Then have "find" grab all other "*.js" files below current directory:
find . -type d ! -name . -exec sh -c "cd {} ; cat *.js >> $OUT" \;
Just to explain the "find" command, it says:
find
. = starting at current directory
-type d = all directories, not files
-! -name . = except the current one
-exec sh -c = and for each one you find execute the following
"..." = go to that directory and concatenate all "*.js" files there onto end of $OUT
\; = and that's all for today, thank you!
I'd get the list of all the files:
$ find src/js -type f -name "*.js" > list.txt
Sort them by depth, i.e. by the number of '/' in them, using the following ruby script:
sort.rb:
files=[]; while gets; files<<$_; end
files.sort! {|a,b| a.count('/') <=> b.count('/')}
files.each {|f| puts f}
Like so:
$ ruby sort.rb < list.txt > sorted.txt
Concatenate them:
$ cat sorted.txt | while read FILE; do cat "$FILE" >> output.txt; done
(All this assumes that your file names don't contain newline characters.)
EDIT:
I was aiming for clarity. If you want conciseness, you can absolutely condense it to something like:
find src/js -name '*.js'| ruby -ne 'BEGIN{f=[];}; f<<$_; END{f.sort!{|a,b| a.count("/") <=> b.count("/")}; f.each{|e| puts e}}' | xargs cat >> concatenated

How to loop through files in a directory having a certain word in filename using bash script?

I'm trying to loop through all the files in a directory which contain a certain word in filename using bash script.
The following script loops through all the files in the directory,
cd path/to/directory
for file in *
do
echo $file
done
ls | grep 'my_word' gives only the files which have the word 'my_word' in the filename. However I'm unsure how to replace the * with ls | grep 'my_word' in the script.
If i do like this,
for file in ls | grep 'my_word'
do
echo $file
done
It gives me an error "syntax error near unexpected token `|'". What is the correct way of doing this?
You should avoid parsing ls where possible. Assuming there are no subdirectories in your present directory, a glob is usually sufficient:
for file in *foo*; do echo "$file"; done
If you have one or more subdirectories, you may need to use find. For example, to cat the files:
find . -type f -name "*foo*" | xargs cat
Or if your filenames contain special characters, try:
find . -type f -name "*foo*" -print0 | xargs -0 cat
Alternatively, you can use process substitution and a while loop:
while IFS= read -r myfile; do echo "$myfile"; done < <(find . -type f -name '*foo*')
Or if your filenames contain special characters, try:
while IFS= read -r -d '' myfile; do
echo "$myfile"
done < <(find . -type f -name '*foo*' -print0)

How to go to each directory and execute a command?

How do I write a bash script that goes through each directory inside a parent_directory and executes a command in each directory.
The directory structure is as follows:
parent_directory (name could be anything - doesnt follow a pattern)
001 (directory names follow this pattern)
0001.txt (filenames follow this pattern)
0002.txt
0003.txt
002
0001.txt
0002.txt
0003.txt
0004.txt
003
0001.txt
the number of directories is unknown.
This answer posted by Todd helped me.
find . -maxdepth 1 -type d \( ! -name . \) -exec bash -c "cd '{}' && pwd" \;
The \( ! -name . \) avoids executing the command in current directory.
You can do the following, when your current directory is parent_directory:
for d in [0-9][0-9][0-9]
do
( cd "$d" && your-command-here )
done
The ( and ) create a subshell, so the current directory isn't changed in the main script.
You can achieve this by piping and then using xargs. The catch is you need to use the -I flag which will replace the substring in your bash command with the substring passed by each of the xargs.
ls -d */ | xargs -I {} bash -c "cd '{}' && pwd"
You may want to replace pwd with whatever command you want to execute in each directory.
If you're using GNU find, you can try -execdir parameter, e.g.:
find . -type d -execdir realpath "{}" ';'
or (as per #gniourf_gniourf comment):
find . -type d -execdir sh -c 'printf "%s/%s\n" "$PWD" "$0"' {} \;
Note: You can use ${0#./} instead of $0 to fix ./ in the front.
or more practical example:
find . -name .git -type d -execdir git pull -v ';'
If you want to include the current directory, it's even simpler by using -exec:
find . -type d -exec sh -c 'cd -P -- "{}" && pwd -P' \;
or using xargs:
find . -type d -print0 | xargs -0 -L1 sh -c 'cd "$0" && pwd && echo Do stuff'
Or similar example suggested by #gniourf_gniourf:
find . -type d -print0 | while IFS= read -r -d '' file; do
# ...
done
The above examples support directories with spaces in their name.
Or by assigning into bash array:
dirs=($(find . -type d))
for dir in "${dirs[#]}"; do
cd "$dir"
echo $PWD
done
Change . to your specific folder name. If you don't need to run recursively, you can use: dirs=(*) instead. The above example doesn't support directories with spaces in the name.
So as #gniourf_gniourf suggested, the only proper way to put the output of find in an array without using an explicit loop will be available in Bash 4.4 with:
mapfile -t -d '' dirs < <(find . -type d -print0)
Or not a recommended way (which involves parsing of ls):
ls -d */ | awk '{print $NF}' | xargs -n1 sh -c 'cd $0 && pwd && echo Do stuff'
The above example would ignore the current dir (as requested by OP), but it'll break on names with the spaces.
See also:
Bash: for each directory at SO
How to enter every directory in current path and execute script? at SE Ubuntu
If the toplevel folder is known you can just write something like this:
for dir in `ls $YOUR_TOP_LEVEL_FOLDER`;
do
for subdir in `ls $YOUR_TOP_LEVEL_FOLDER/$dir`;
do
$(PLAY AS MUCH AS YOU WANT);
done
done
On the $(PLAY AS MUCH AS YOU WANT); you can put as much code as you want.
Note that I didn't "cd" on any directory.
Cheers,
for dir in PARENT/*
do
test -d "$dir" || continue
# Do something with $dir...
done
While one liners are good for quick and dirty usage, I prefer below more verbose version for writing scripts. This is the template I use which takes care of many edge cases and allows you to write more complex code to execute on a folder. You can write your bash code in the function dir_command. Below, dir_coomand implements tagging each repository in git as an example. Rest of the script calls dir_command for each folder in directory. The example of iterating through only given set of folder is also include.
#!/bin/bash
#Use set -x if you want to echo each command while getting executed
#set -x
#Save current directory so we can restore it later
cur=$PWD
#Save command line arguments so functions can access it
args=("$#")
#Put your code in this function
#To access command line arguments use syntax ${args[1]} etc
function dir_command {
#This example command implements doing git status for folder
cd $1
echo "$(tput setaf 2)$1$(tput sgr 0)"
git tag -a ${args[0]} -m "${args[1]}"
git push --tags
cd ..
}
#This loop will go to each immediate child and execute dir_command
find . -maxdepth 1 -type d \( ! -name . \) | while read dir; do
dir_command "$dir/"
done
#This example loop only loops through give set of folders
declare -a dirs=("dir1" "dir2" "dir3")
for dir in "${dirs[#]}"; do
dir_command "$dir/"
done
#Restore the folder
cd "$cur"
I don't get the point with the formating of the file, since you only want to iterate through folders... Are you looking for something like this?
cd parent
find . -type d | while read d; do
ls $d/
done
you can use
find .
to search all files/dirs in the current directory recurive
Than you can pipe the output the xargs command like so
find . | xargs 'command here'
#!/bin.bash
for folder_to_go in $(find . -mindepth 1 -maxdepth 1 -type d \( -name "*" \) ) ;
# you can add pattern insted of * , here it goes to any folder
#-mindepth / maxdepth 1 means one folder depth
do
cd $folder_to_go
echo $folder_to_go "########################################## "
whatever you want to do is here
cd ../ # if maxdepth/mindepath = 2, cd ../../
done
#you can try adding many internal for loops with many patterns, this will sneak anywhere you want
You could run sequence of commands in each folder in 1 line like:
for d in PARENT_FOLDER/*; do (cd "$d" && tar -cvzf $d.tar.gz *.*)); done
for p in [0-9][0-9][0-9];do
(
cd $p
for f in [0-9][0-9][0-9][0-9]*.txt;do
ls $f; # Your operands
done
)
done

Resources