How to quickly find all git repos under a directory - bash

The following bash script is slow when scanning for .git directories because it looks at every directory. If I have a collection of large repositories it takes a long time for find to churn through every directory, looking for .git. It would go much faster if it would prune the directories within repos, once a .git directory is found. Any ideas on how to do that, or is there another way to write a bash script that accomplishes the same thing?
#!/bin/bash
# Update all git directories below current directory or specified directory
HIGHLIGHT="\e[01;34m"
NORMAL='\e[00m'
DIR=.
if [ "$1" != "" ]; then DIR=$1; fi
cd $DIR>/dev/null; echo -e "${HIGHLIGHT}Scanning ${PWD}${NORMAL}"; cd ->/dev/null
for d in `find . -name .git -type d`; do
cd $d/.. > /dev/null
echo -e "\n${HIGHLIGHT}Updating `pwd`$NORMAL"
git pull
cd - > /dev/null
done
Specifically, how would you use these options? For this problem, you cannot assume that the collection of repos is all in the same directory; they might be within nested directories.
top
repo1
dirA
dirB
dirC
repo1

Check out Dennis' answer in this post about find's -prune option:
How to use '-prune' option of 'find' in sh?
find . -name .git -type d -prune
Will speed things up a bit, as find won't descend into .git directories, but it still does descend into git repositories, looking for other .git folders. And that 'could' be a costly operation.
What would be cool is if there was some sort of find lookahead pruning mechanism, where if a folder has a subfolder called .git, then prune on that folder...
That said, I'm betting your bottleneck is in the network operation 'git pull', and not in the find command, as others have posted in the comments.

Here is an optimized solution:
#!/bin/bash
# Update all git directories below current directory or specified directory
# Skips directories that contain a file called .ignore
HIGHLIGHT="\e[01;34m"
NORMAL='\e[00m'
function update {
local d="$1"
if [ -d "$d" ]; then
if [ -e "$d/.ignore" ]; then
echo -e "\n${HIGHLIGHT}Ignoring $d${NORMAL}"
else
cd $d > /dev/null
if [ -d ".git" ]; then
echo -e "\n${HIGHLIGHT}Updating `pwd`$NORMAL"
git pull
else
scan *
fi
cd .. > /dev/null
fi
fi
#echo "Exiting update: pwd=`pwd`"
}
function scan {
#echo "`pwd`"
for x in $*; do
update "$x"
done
}
if [ "$1" != "" ]; then cd $1 > /dev/null; fi
echo -e "${HIGHLIGHT}Scanning ${PWD}${NORMAL}"
scan *

I've taken the time to copy-paste the script in your question, compare it to the script with your own answer. Here some interesting results:
Please note that:
I've disabled the git pull by prefixing them with a echo
I've removed also the color things
I've removed also the .ignore file testing in the bash solution.
And removed the unecessary > /dev/null here and there.
removed pwd calls in both.
added -prune which is obviously lacking in the find example
used "while" instead of "for" which was also counter productive in the find example
considerably untangled the second example to get to the point.
added a test on the bash solution to NOT follow sym link to avoid cycles and behave as the find solution.
added shopt to allow * to expand to dotted directory names also to match find solution's functionality.
Thus, we are comparing, the find based solution:
#!/bin/bash
find . -name .git -type d -prune | while read d; do
cd $d/..
echo "$PWD >" git pull
cd $OLDPWD
done
With the bash shell builting solution:
#!/bin/bash
shopt -s dotglob
update() {
for d in "$#"; do
test -d "$d" -a \! -L "$d" || continue
cd "$d"
if [ -d ".git" ]; then
echo "$PWD >" git pull
else
update *
fi
cd ..
done
}
update *
Note: builtins (function and the for) are immune to MAX_ARGS OS limit for launching processes. So the * won't break even on very large directories.
Technical differences between solutions:
The find based solution uses C function to crawl repository, it:
has to load a new process for the find command.
will avoid ".git" content but will crawl workdir of git repositories, and loose some
times in those (and eventually find more matching elements).
will have to chdir through several depth of sub-dir for each match and go back.
will have to chdir once in the find command and once in the bash part.
The bash based solution uses builtin (so near-C implementation, but interpreted) to crawl repository, note that it:
will use only one process.
will avoid git workdir subdirectory.
will only perform chdir one level at a time.
will only perform chdir once for looking and performing the command.
Actual speed results between solutions:
I have a working development collection of git repository on which I launched the scripts:
find solution: ~0.080s (bash chdir takes ~0.010s)
bash solution: ~0.017s
I have to admit that I wasn't prepared to see such a win from bash builtins. It became
more apparent and normal after doing the analysis of what's going on. To add insult to injuries, if you change the shell from /bin/bash to /bin/sh (you must comment out the shopt line, and be prepared that it won't parse dotted directories), you'll fall to
~0.008s . Beat that !
Note that you can be more clever with the find solution by using:
find . -type d \( -exec /usr/bin/test -d "{}/.git" -a "{}" != "." \; -print -prune \
-o -name .git -prune \)
which will effectively remove crawling all sub-repository in a found git repository, at the price of spawning a process for each directory crawled. The final find solution I came with was around ~0.030s, which is more than twice faster than the previous find version, but remains 2 times slower than the bash solution.
Note that /usr/bin/test is important to avoid search in $PATH which costs time, and I needed -o -name .git -prune and -a "{}" != "." because my main repository was itself a git subrepository.
As a conclusion, I won't be using the bash builtin solution because it has too much corner cases for me (and my first test hit one of the limitation). But it was important for me to explain why it could be (much) faster in some cases, but find solution seems much more robust and consistent to me.

The answers above all rely on finding a ".git" repository. However not all git repos have these (e.g. bare repos). The following command will loop through all directories and ask git if it considers each to be a directory. If so, it prunes sub dirs off the tree and continues.
find . -type d -exec sh -c 'cd "{}"; git rev-parse --git-dir 2> /dev/null 1>&2' \; -prune -print
It's a lot slower than other solutions because it's executing a command in each directory, but it doesn't rely on a particular repository structure. Could be useful for finding bare git repositories for example.

I list all git repositories anywhere in the current directory using:
find . -type d -execdir test -d {}/.git \\; -prune -print
This is fast since it stops recursing once it finds a git repository. (Although it does not handle bare repositories.) Of course, you can change the . to whatever directory you want. If you need, you can change the -print to -print0 for null-separated values.
To also ignore directories containing a .ignore file:
find . -type d \( -execdir test -e {}/.ignore \; -prune \) -o \( -execdir test -d {}/.git \; -prune -print \)
I've added this alias to my ~/.gitconfig file:
[alias]
repos = !"find -type d -execdir test -d {}/.git \\; -prune -print"
Then I just need to execute:
git repos
To get a complete listing of all the git repositories anywhere in my current directory.

For windows, you can put the following into a batch file called gitlist.bat and put it on your PATH.
#echo off
if {%1}=={} goto :usage
for /r %1 /d %%I in (.) do echo %%I | find ".git\."
goto :eof
:usage
echo usage: gitlist ^<path^>

Check out the answer using the locate command:
Is there any way to list up git repositories in terminal?
The advantages of using locate instead of a custom script are:
The search is indexed, so it scales
It does not require the use (and maintenance) of a custom bash script
The disadvantages of using locate are:
The db that locate uses is updated weekly, so freshly-created git repositories won't show up
Going the locate route, here's how to list all git repositories under a directory, for OS X:
Enable locate indexing (will be different on Linux):
sudo launchctl load -w /System/Library/LaunchDaemons/com.apple.locate.plist
Run this command after indexing completes (might need some tweaking for Linux):
repoBasePath=$HOME
locate '.git' | egrep '.git$' | egrep "^$repoBasePath" | xargs -I {} dirname "{}"

This answer combines the partial answer provided #Greg Barrett with my optimized answer above.
#!/bin/bash
# Update all git directories below current directory or specified directory
# Skips directories that contain a file called .ignore
HIGHLIGHT="\e[01;34m"
NORMAL='\e[00m'
export PATH=${PATH/':./:'/:}
export PATH=${PATH/':./bin:'/:}
#echo "$PATH"
DIRS="$( find "$#" -type d \( -execdir test -e {}/.ignore \; -prune \) -o \( -execdir test -d {}/.git \; -prune -print \) )"
echo -e "${HIGHLIGHT}Scanning ${PWD}${NORMAL}"
for d in $DIRS; do
cd "$d" > /dev/null
echo -e "\n${HIGHLIGHT}Updating `pwd`$NORMAL"
git pull 2> >(sed -e 's/X11 forwarding request failed on channel 0//')
cd - > /dev/null
done

Related

Correct usage of find and while-read loop in different formats?

After reading multiple anwers on stackoverflow I came up with the following solution to read directory paths from find's output:
find "$searchdir" -type d -execdir test -d {}/.git \; -prune -print0 | while read -r -d $'\0' dir; do
# do stuff
done
However, most sources recommend something like the following approach:
while IFS= read -r -d '' file; do
some command "$file"
done < <(find . -type f -name '*.mp3' -print0)
Why are they using process substitution? Does this change anything about the whole process or is it just an other way to do the same thing?
Is the read argument -d '' different from -d $'\0' or again the same thing? Does empty string always contain at least \0 so the bash specific $'' syntax is completely unnecessary?
I also tried doing it directly in find -exec/-execdir by passing it multiple times and failed. Maybe filtering and testing can be done in one command?
non working example:
find "$repositories_root_dir" -type d -execdir test -d {}/.git \; -prune -execdir sh -c "if git ls-remote --exit-code . \"origin/${target_branch_name}\" &> /dev/null; then echo \"Found branch '${target_branch_name}' in {}\"; git checkout \"${target_branch_name}\"; fi" \;
Sources:
https://github.com/koalaman/shellcheck/wiki/Sc2044
https://mywiki.wooledge.org/BashPitfalls#for_f_in_.24.28ls_.2A.mp3.29
In your non-working example, if you test the existence of a .git sub-directory to process only git clones and discard the other directories, then you should probably not prune because it does the exact opposite: skip only git clones.
Moreover, when using -execdir sh -c SCRIPT, you should pass positional parameters to your script instead of trying to embed the current directory name in the script with {}, which is not portable. And you could do the same for the branch name. Note that the directory name is not needed for what you try to accomplish in each git clone, because your script is executed from there.
Try this, maybe:
find "$repositories_root_dir" -type d -name '.git' -execdir sh -c '
if git ls-remote --exit-code . "origin/$1" &> /dev/null; then
printf "Found branch %s in " "$1"; pwd
echo git checkout "$1"
fi' _ "$target_branch_name" \;
(_ is assigned to positional parameter $0). Remove the echo if the result looks correct.

Simple way of listing all directories up to git directories

I have my projects in ${HOME}/projects/ in a hierarchical structure, something like this:
${HOME}/projects/
- custumer1/
- c1_git_repo1
- subdir1
- ...
- c1_git_repo2
- ...
- customer2/
- c2_project1/
- c2_p1_git_repo1
- ...
- c2_p1_git_repo2
- ...
...
I want to fill my CDPATH with all directories below {$HOME}/projects up to the git repos. So in the example above this would be:
CDPATH=${HOME}/projects/customer1:${HOME}/projects/customer1/c1_git_repo1:${HOME}/projects/customer1/c1_git_repo2:${HOME}/projects/customer2:${HOME}/projects/customer2/c2_project1:${HOME}/projects/customer2/c2_project1/c2_p1_git_repo1:${HOME}/projects/customer2/c2_project1/c2_p1_git_repo2
So I need to find all directories below {$HOME}/projects/, but stop at any directory which contains a ".git" folder.
Is there some cmd tool that can list the directories for me?
The following fetches all directories from $HOME/projects, except .git and contents of the .git directories.
paths=$(
find "$HOME/projects/" -mindepth 2 -type d -not -ipath '*/.git*' \
| while read d
do
builtin printf %q: "$d"
done
)
echo $paths
Sample output
/home/user/projects/dir\ with\ spaces:/home/user/projects/prj2
In the while loop we escape directory paths and append a colon by means of bulit-int printf function.
The result is stored in paths variable.
I would use something like this to find the directories:
find "$HOME/projects" -type d -execdir test -d {}/.git \; -print -prune
This stops the find whenever there's a .git, so it won't descend into the repos looking for submodules, nor will it actually descend into any of the .git directories themselves.
Next you want to get the result of that command in an array (I named the array repos). If you're using bash version 4+, the easiest way to do that is mapfile, which won't split on any spaces inside the pathnames:
mapfile -t repos < <(
find "$HOME/projects" -type d -execdir test -d {}/.git \; -print -prune)
If you're on a Mac using /bin/bash, you don't have that option, so you can use read with an altered IFS and the -d option to achieve the same result:
IFS=$'\n' read -r -d '' -a repos < <(
find "$HOME/projects" -type d -execdir test -d {}/.git \; -print -prune)
Either way, once you have the pathnames in the repos array, you can then assign CDPATH like so:
CDPATH=${repos[0]}$(printf ":%s" "${repos[#]:1}"})

bash, "make clean" in all subdirectories

How can I find every Makefile file in the current path and subdirs and run a make clean command in every occurance.
What I have till now (does not work) is something like:
find . -type f -name 'Makefile' 2>/dev/null | sed 's#/Makefile##' | xargs -I% cd % && make clean && cd -
Another option would be to use find with -execdir but this gives me the issue with $PATH : The current directory is included in the PATH environment variable, which is insecure in combination with the -execdir action of find ....
But I do not want to change the $PATH variable.
An answer using the tools I used would be helpful so that I can understand what I do wrong,
but any working answer is acceptable.
Of course find is an option.. My approach with that would be more like:
find . -name Makefile -exec bash -c 'make -C "${1%/*}" clean' -- {} \;
But since you're using bash anyway, if you're in bash 4, you might also use globstar.
shopt -s globstar
for f in **/Makefile; do make -C "${f%/*}" clean; done
If you want to use the execution feature of find you can still do this:
find "${PWD}" -name Makefile -exec sh -c 'cd "${0%Makefile}" && make clean' {} \;
I would use the following approach:
find "$(pwd)" -name Makefile | while read -r line; do cd "$(dirname "$line")" && make clean; done
Please note the find $(pwd) which gives the full path as output of find.

Renaming Subdirectories and Files

I have a script using a for loop that would rename folders and files. The script would take the list of files and folders and rename them conditionally. I would invoke the file using the command:
find test/* -exec ./replace.sh {} \;
My replace.sh script would contain something similar to:
for i in $#
mv $OLDFILE $NEWFILE
done
$OLDFILE and $NEWFILE has been set previously and I don't believe any problems will arise from them.
My problem arises when I hit upon subdirectories. Originally, I would have folders like:
folder_1
-file1
-file2
When my script changes folder_1 into folderX1, the next argument, folder_1/file1 woudl be invalid as the changed path would be folderX1/file1. I figured I could create a stack with a list of folders that is being changed and pop them out later to rename the files but this seems hard on bash. Is there a better method that I am missing?
P.S I could run the program several times to go through all the subdirectories but this doesn't seem efficient.
You can add -depth to the find command. This will process the directory's files before the directory itself. See man find for details.
Your find usage is problematic. The first option is the start location for the search, so you don't want to use a glob there. If you want only the files in test/ and not any of its subdirectories, use the -depth option, as Olaf suggested.
You don't really need to use a separate script to handle this rename. It can be done within the find command line, if you don't mind a little mess.
To handle just the top-level of files, you could do this:
$ touch foo.txt bar.txt baz.ext
$ find test -depth 1 -type f -name \*.txt -exec bash -c 'f="{}"; mv -v "{}" "${f/.txt/.csv}"' \;
./foo.txt -> ./foo.csv
./bar.txt -> ./bar.csv
$
But your concern is valid -- find will build a list of matches, and if your -exec changes the list out from under find, some renames will fail.
I suspect your quickest solution is to do this in TWO stages (not several): one for files, followed by one for directories. (Or change the order, I don't think it should matter.)
$ mkdir foo_1; touch red_2 foo_1/blue_3
$ find . -type f -name \*_\* -exec bash -c 'f="{}"; mv -v "{}" "${f%_?}X${f##*_}"' \;
./foo_1/blue_3 -> ./foo_1/blueX3
./red_2 -> ./redX2
$ find . -type d -name \*_\* -exec bash -c 'f="{}"; mv -v "{}" "${f%_?}X${f##*_}"' \;
./foo_1 -> ./fooX1
Bash parameter expansion will get you a long way.
Another option, depending on your implementation of find, is the -d option:
-d Cause find to perform a depth-first traversal, i.e., directories
are visited in post-order and all entries in a directory will be
acted on before the directory itself. By default, find visits
directories in pre-order, i.e., before their contents. Note, the
default is not a breadth-first traversal.
So:
$ mkdir -p foo_1/bar_2; touch red_3 foo_1/blue_4 foo_1/bar_2/green_5
$ find . -d -name \*_\* -exec bash -c 'f="{}"; mv -v "{}" "${f%_?}X${f##*_}"' \;
./foo_1/bar_2/green_5 -> ./foo_1/bar_2/greenX5
./foo_1/bar_2 -> ./foo_1/barX2
./foo_1/blue_4 -> ./foo_1/blueX4
./foo_1 -> ./fooX1
./red_3 -> ./redX3
$

Script to recursively delete CVS directory on server

So far I've come up with this:
find . -name 'CVS' -type d -exec rm -rf {} \;
It's worked locally thus far, can anyone see any potential issues? I want this to basically recursively delete 'CVS' directories accidently uploaded on to a server.
Also, how can I make it a script in which I can specify a directory to clean up?
Well, the obvious caveat: It'll delete directories named CVS, regardless of if they're CVS directories or not.
You can turn it into a script fairly easily:
#!/bin/sh
if [ -z "$1" ]; then
echo "Usage: $0 path"
exit 1
fi
find "$1" -name 'CVS' -type d -print0 | xargs -0 rm -Rf
# or find … -exec like you have, if you can't use -print0/xargs -0
# print0/xargs will be slightly faster.
# or find … -exec rm -Rf '{}' + if you have reasonably modern find
edit
If you want to make it safer/more fool-proof, you could do something like this after the first if/fi block (there are several ways to write this):
⋮
case "$1" in
/srv/www* | /home)
true
;;
*)
echo "Sorry, can only clean from /srv/www and /home"
exit 1
;;
esac
⋮
You can make it as fancy as you want (for example, instead of aborting, it could prompt if you really meant to do that). Or you could make it resolve relative paths, so you wouldn't have to always specify a full path (but then again, maybe you want that, to be safer).
A simple way to do would be:
find . -iname CVS -type d | xargs rm
-rf

Resources