Caching APT packages in GitHub Actions workflow - apt

I use the following Github Actions workflow for my C project. The workflow finishes in ~40 seconds, but more than half of that time is spent by installing the valgrind package and its dependencies.
I believe caching could help me speed up the workflow. I do not mind waiting a couple of extra seconds, but this just seems like a pointless waste of GitHub's resources.
name: C Workflow
on: [push, pull_request]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout#v1
- name: make
run: make
- name: valgrind
run: |
sudo apt-get install -y valgrind
valgrind -v --leak-check=full --show-leak-kinds=all ./bin
Running sudo apt-get install -y valgrind installs the following packages:
gdb
gdbserver
libbabeltrace1
libc6-dbg
libipt1
valgrind
I know Actions support caching of a specific directory (and there are already several answered SO questions and articles about this), but I am not sure where all the different packages installed by apt end up. I assume /bin/ or /usr/bin/ are not the only directories affected by installing packages.
Is there an elegant way to cache the installed system packages for future workflow runs?

The purpose of this answer is to show how caching can be done with github actions, not necessarily to show how to cache valgrind, (which it does). I also try to explain why not everything can/should be cached, because the cost (in terms of time) of caching and restoring a cache, vs reinstalling the dependency needs to be taken into account.
You will make use of the actions/cache action to do this.
Add it as a step (before you need to use valgrind):
- name: Cache valgrind
uses: actions/cache#v2
id: cache-valgrind
with:
path: "~/valgrind"
key: ${{secrets.VALGRIND_VERSION}}
The next step should attempt to install the cached version if any or install from the repositories:
- name: Install valgrind
env:
CACHE_HIT: ${{steps.cache-valgrind.outputs.cache-hit}}
VALGRIND_VERSION: ${{secrets.VALGRIND_VERSION}}
run: |
if [[ "$CACHE_HIT" == 'true' ]]; then
sudo cp --verbose --force --recursive ~/valgrind/* /
else
sudo apt-get install --yes valgrind="$VALGRIND_VERSION"
mkdir -p ~/valgrind
sudo dpkg -L valgrind | while IFS= read -r f; do if test -f $f; then echo $f; fi; done | xargs cp --parents --target-directory ~/valgrind/
fi
Explanation
Set VALGRIND_VERSION secret to be the output of:
apt-cache policy valgrind | grep -oP '(?<=Candidate:\s)(.+)'
this will allow you to invalidate the cache when a new version is released simply by changing the value of the secret.
dpkg -L valgrind is used to list all the files installed when using sudo apt-get install valgrind.
What we can now do with this command is to copy all the dependencies to our cache folder:
dpkg -L valgrind | while IFS= read -r f; do if test -f $f; then echo $f; fi; done | xargs cp --parents --target-directory ~/valgrind/
Furthermore
In addition to copying all the components of valgrind, it may also be necessary to copy the dependencies (such as libc in this case), but I don't recommend continuing along this path because the dependency chain just grows from there. To be precise, the dependencies needed to copy to finally have an environment suitable for valgrind to run in is as follows:
libc6
libgcc1
gcc-8-base
To copy all these dependencies, you can use the same syntax as above:
for dep in libc6 libgcc1 gcc-8-base; do
dpkg -L $dep | while IFS= read -r f; do if test -f $f; then echo $f; fi; done | xargs cp --parents --target-directory ~/valgrind/
done
Is all this work really worth the trouble when all that is required to install valgrind in the first place is to simply run sudo apt-get install valgrind? If your goal is to speed up the build process, then you also have to take into consideration the amount of time it is taking to restore (downloading, and extracting) the cache vs simply running the command again to install valgrind.
And finally to restore the cache, assuming it is stored at /tmp/valgrind, you can use the command:
cp --force --recursive /tmp/valgrind/* /
Which will basically copy all the files from the cache unto the root partition.
In addition to the process above, I also have an example of "caching valgrind" by installing and compiling it from source. The cache is now about 63MB (compressed) in size and one still needs to separately install libc which kind of defeats the purpose.
Note: Another answer to this question proposes what I could consider to be a safer approach to caching dependencies, by using a container which comes with the dependencies pre-installed. The best part is that you can use actions to keep those containers up-to-date.
References:
https://askubuntu.com/a/408785
https://unix.stackexchange.com/questions/83593/copy-specific-file-type-keeping-the-folder-structure

You could create a docker image with valgrind preinstalled and run your workflow on that.
Create a Dockerfile with something like:
FROM ubuntu
RUN apt-get install -y valgrind
Build it and push it to dockerhub:
docker build -t natiiix/valgrind .
docker push natiiix/valgrind
Then use something like the following as your workflow:
name: C Workflow
on: [push, pull_request]
jobs:
build:
container: natiiix/valgrind
steps:
- uses: actions/checkout#v1
- name: make
run: make
- name: valgrind
run: valgrind -v --leak-check=full --show-leak-kinds=all ./bin
Completely untested, but you get the idea.

Updated:
I created a GitHub action which work as this solution, less code and better optimizations. Cache Anything New
This solution is similar to the most voted. I tried the proposed solution but it didn't work for me because I was installing texlive-latex, and pandoc which has many dependencies and sub-dependencies.
I created a solution which should help many people. One case is when you install a couple of packages (apt install), the other solution is when you make a program and it takes for a while.
Solution:
Step which has all the logic, it will cache.
Use find to create a list of all the files in the container.
Install all the packages or make the programs, whatever that you want to cache.
Use find to create a list of all the files in the container.
Use diff to get the new created files.
Add these new files to the cache directory. This directory will automatically store with actions/cache#v2.
Step which load the created cache.
Copy all the files from the cache directory to the main path /.
Steps which will be benefited by the cache and other steps that you need.
When to use this?
I didn't use cache, the installation of the packages was around ~2 minutes to finish all the process.
With the cache, it takes 7~10 minutes to create it the first time.
Using the cache takes ~ 1 minute to finish all the process.
It is useful only if your main process take a lot of time also it is convenient if you're deploying very often.
Implementation:
Source code: .github/workflows
Landing page of my actions: workflows.
release.yml
name: CI - Release books
on:
release:
types: [ released ]
workflow_dispatch:
jobs:
build:
runs-on: ubuntu-18.04
steps:
- uses: actions/checkout#v2
- uses: actions/cache#v2
id: cache-packages
with:
path: ${{ runner.temp }}/cache-linux
key: ${{ runner.os }}-cache-packages-v2.1
- name: Install packages
if: steps.cache-packages.outputs.cache-hit != 'true'
env:
SOURCE: ${{ runner.temp }}/cache-linux
run: |
set +xv
echo "# --------------------------------------------------------"
echo "# Action environment variables"
echo "github.workspace: ${{ github.workspace }}"
echo "runner.workspace: ${{ runner.workspace }}"
echo "runner.os: ${{ runner.os }}"
echo "runner.temp: ${{ runner.temp }}"
echo "# --------------------------------------------------------"
echo "# Where am I?"
pwd
echo "SOURCE: ${SOURCE}"
ls -lha /
sudo du -h -d 1 / 2> /dev/null || true
echo "# --------------------------------------------------------"
echo "# APT update"
sudo apt update
echo "# --------------------------------------------------------"
echo "# Set up snapshot"
mkdir -p "${{ runner.temp }}"/snapshots/
echo "# --------------------------------------------------------"
echo "# Install tools"
sudo rm -f /var/lib/apt/lists/lock
#sudo apt install -y vim bash-completion
echo "# --------------------------------------------------------"
echo "# Take first snapshot"
sudo find / \
-type f,l \
-not \( -path "/sys*" -prune \) \
-not \( -path "/proc*" -prune \) \
-not \( -path "/mnt*" -prune \) \
-not \( -path "/dev*" -prune \) \
-not \( -path "/run*" -prune \) \
-not \( -path "/etc/mtab*" -prune \) \
-not \( -path "/var/cache/apt/archives*" -prune \) \
-not \( -path "/tmp*" -prune \) \
-not \( -path "/var/tmp*" -prune \) \
-not \( -path "/var/backups*" \) \
-not \( -path "/boot*" -prune \) \
-not \( -path "/vmlinuz*" -prune \) \
> "${{ runner.temp }}"/snapshots/snapshot_01.txt 2> /dev/null \
|| true
echo "# --------------------------------------------------------"
echo "# Install pandoc and dependencies"
sudo apt install -y texlive-latex-extra wget
wget -q https://github.com/jgm/pandoc/releases/download/2.11.2/pandoc-2.11.2-1-amd64.deb
sudo dpkg -i pandoc-2.11.2-1-amd64.deb
rm -f pandoc-2.11.2-1-amd64.deb
echo "# --------------------------------------------------------"
echo "# Take second snapshot"
sudo find / \
-type f,l \
-not \( -path "/sys*" -prune \) \
-not \( -path "/proc*" -prune \) \
-not \( -path "/mnt*" -prune \) \
-not \( -path "/dev*" -prune \) \
-not \( -path "/run*" -prune \) \
-not \( -path "/etc/mtab*" -prune \) \
-not \( -path "/var/cache/apt/archives*" -prune \) \
-not \( -path "/tmp*" -prune \) \
-not \( -path "/var/tmp*" -prune \) \
-not \( -path "/var/backups*" \) \
-not \( -path "/boot*" -prune \) \
-not \( -path "/vmlinuz*" -prune \) \
> "${{ runner.temp }}"/snapshots/snapshot_02.txt 2> /dev/null \
|| true
echo "# --------------------------------------------------------"
echo "# Filter new files"
diff -C 1 \
--color=always \
"${{ runner.temp }}"/snapshots/snapshot_01.txt \
"${{ runner.temp }}"/snapshots/snapshot_02.txt \
| grep -E "^\+" \
| sed -E s/..// \
> "${{ runner.temp }}"/snapshots/snapshot_new_files.txt
< "${{ runner.temp }}"/snapshots/snapshot_new_files.txt wc -l
ls -lha "${{ runner.temp }}"/snapshots/
echo "# --------------------------------------------------------"
echo "# Make cache directory"
rm -fR "${SOURCE}"
mkdir -p "${SOURCE}"
while IFS= read -r LINE
do
sudo cp -a --parent "${LINE}" "${SOURCE}"
done < "${{ runner.temp }}"/snapshots/snapshot_new_files.txt
ls -lha "${SOURCE}"
echo ""
sudo du -sh "${SOURCE}" || true
echo "# --------------------------------------------------------"
- name: Copy cached packages
if: steps.cache-packages.outputs.cache-hit == 'true'
env:
SOURCE: ${{ runner.temp }}/cache-linux
run: |
echo "# --------------------------------------------------------"
echo "# Using Cached packages"
ls -lha "${SOURCE}"
sudo cp --force --recursive "${SOURCE}"/. /
echo "# --------------------------------------------------------"
- name: Generate release files and commit in GitHub
run: |
echo "# --------------------------------------------------------"
echo "# Generating release files"
git fetch --all
git pull --rebase origin main
git checkout main
cd ./src/programming-from-the-ground-up
./make.sh
cd ../../
ls -lha release/
git config --global user.name 'Israel Roldan'
git config --global user.email 'israel.alberto.rv#gmail.com'
git add .
git status
git commit -m "Automated Release."
git push
git status
echo "# --------------------------------------------------------"
Explaining some pieces of the code:
Here the action cache, indicate a key which will be generated once and compare in later executions. The path is the directory where the files should be to generate the cache compressed file.
- uses: actions/cache#v2
id: cache-packages
with:
path: ${{ runner.temp }}/cache-linux
key: ${{ runner.os }}-cache-packages-v2.1
This conditional search for the key cache, if it exits the cache-hit is 'true'.
if: steps.cache-packages.outputs.cache-hit != 'true'
if: steps.cache-packages.outputs.cache-hit == 'true'
It's not critical but when the du command executes at first time, Linux indexed all the files (5~8 minutes), then when we will use the find, it will take only ~50 seconds to get all the files. You can delete this line, if you want.
The suffixed command || true prevents that 2> /dev/null return error otherwise the action will stop because it will detect that your script has an error output. You will see during the script a couple of theses.
sudo du -h -d 1 / 2> /dev/null || true
This is the magical part, use find to generate a list of the actual files, excluding some directories to optimize the cache folder. It also will be executed after the installations and make programs. In the next snapshot the file name should be different snapshot_02.txt.
sudo find / \
-type f,l \
-not \( -path "/sys*" -prune \) \
-not \( -path "/proc*" -prune \) \
-not \( -path "/mnt*" -prune \) \
-not \( -path "/dev*" -prune \) \
-not \( -path "/run*" -prune \) \
-not \( -path "/etc/mtab*" -prune \) \
-not \( -path "/var/cache/apt/archives*" -prune \) \
-not \( -path "/tmp*" -prune \) \
-not \( -path "/var/tmp*" -prune \) \
-not \( -path "/var/backups*" \) \
-not \( -path "/boot*" -prune \) \
-not \( -path "/vmlinuz*" -prune \) \
> "${{ runner.temp }}"/snapshots/snapshot_01.txt 2> /dev/null \
|| true
Install some packages and pandoc.
sudo apt install -y texlive-latex-extra wget
wget -q https://github.com/jgm/pandoc/releases/download/2.11.2/pandoc-2.11.2-1-amd64.deb
sudo dpkg -i pandoc-2.11.2-1-amd64.deb
rm -f pandoc-2.11.2-1-amd64.deb
Generate the text file with the new files added, the files could be symbolic files, too.
diff -C 1 \
"${{ runner.temp }}"/snapshots/snapshot_01.txt \
"${{ runner.temp }}"/snapshots/snapshot_02.txt \
| grep -E "^\+" \
| sed -E s/..// \
> "${{ runner.temp }}"/snapshots/snapshot_new_files.txt
At the end copy all the files into the cache directory as an archive to keep the original information.
while IFS= read -r LINE
do
sudo cp -a --parent "${LINE}" "${SOURCE}"
done < "${{ runner.temp }}"/snapshots/snapshot_new_files.txt
Step to copy all the cached files into the main path /.
- name: Copy cached packages
if: steps.cache-packages.outputs.cache-hit == 'true'
env:
SOURCE: ${{ runner.temp }}/cache-linux
run: |
echo "# --------------------------------------------------------"
echo "# Using Cached packages"
ls -lha "${SOURCE}"
sudo cp --force --recursive "${SOURCE}"/. /
echo "# --------------------------------------------------------"
This step is where I'm using the installed packages generated by the cache, the ./make.sh script use pandoc to do some conversions. As I mentioned, you can create other steps which use the cache benefits or another which not use the cache.
- name: Generate release files and commit in GitHub
run: |
echo "# --------------------------------------------------------"
echo "# Generating release files"
cd ./src/programming-from-the-ground-up
./make.sh

Just for instance, there is already exists several implementations:
https://github.com/awalsh128/cache-apt-pkgs-action
installs and uses apt-fast from https://git.io/vokNn instead of direct usage the apt-get (https://askubuntu.com/questions/52243/what-is-apt-fast-and-should-i-use-it)
generates unique cache directory name from input packages list
uses dpkg -L to enlist changes
tars package files into ${cache_dir}/${installed_package}.tar (without compression).
Compression is not required as long as action/cache does compression:
https://github.com/awalsh128/cache-apt-pkgs-action/issues/46 https://github.com/awalsh128/cache-apt-pkgs-action/pull/53
https://github.com/airvzxf/cache-anything-new-action
Caching APT packages in GitHub Actions workflow
Scans the Linux container to check if anything new was added after you run your custom script then, it will cache all the new files.
script must be in a standalone file inside the GitHub workflows directory
does not generate unique cache directory name
can exclude user directories from the scan
Can be much slower than just use the dpkg -L, but finds all the changes in the file system
https://github.com/Mudlet/xmlstarlet-action
example of the docker file to run xmlstarlet with arguments
limited to static or already committed Dockerfile and entrypoint.sh, can not use external script or instruction set
must be used from the GitHub Actions pipeline only, can not be used from the inner bash or whatever script call because an install and a run can not be separated
~50% slower than a single apt-get install, but can be faster for multiple packages

Related

xargs with multiple commands only working on some files

I'm trying (starting with my Macbook) to get a list of all image files matching the specification in the line below along with their size and sha512. I'm doing this to audit the tens of thousands of such files I have spread over mutliple systems.
sudo find /Users \( -iname '*.JPG' -or -iname '*.NEF' -or -iname '*.PNG'
-or -iname '*.RAF' -or -iname '*.PW2' -or -iname '*.DNG' \) -type f -and
-size +10000k -print0 | xargs -0 -I ##
/bin/bash -c '{ stat -n -f"MACBOOK %z " "##" && shasum -p -a 512 "##"; }'
When run, this correctly produces the output I want for some of the files, for example I get;
MACBOOK 32465640 <SHA512-REDACTED> ?/Users/<REDACTED>/Pictures/Pendleton Roundup/2018/2018-09-13/_DSC3955.NEF
But for some of the files, the ## replacement doesn't seem to work properly and instead I get;
MACBOOK 28130793 shasum: ##:
If I add a -v flag to the bash line to print out the commands I expect to be executed when it goes wrong I see this;
{ stat -n -f"MACBOOK %z " "/Users/<REDACTED>/Pictures/Photos Library D750.photoslibrary/Masters/2018/07/29/20180729-223141/DSC_3274.NEF"; shasum -p -a 512 "##"; }
If I manually run that line with the ## replaced with the filename, it works as expected, so it seems that the -I ## parameter to xargs is somehow not always working and I'm at a loss as to what the cause might me.
Can anyone help me evolve a fix for this? I've tried putting the ## in quotes, tried with different patterns and always the same issue.
Consider:
find_args=( -false )
for type in jpg nef png raf pw2 dng; do
find_args+=( -o -name "*.$type" )
done
sudo find /Users '(' "${find_args[#]}" ')' \
-type f \
-size +10000k \
-exec sh -c '
for arg; do
stat -n -f"MACBOOK %z " "$arg"
shasum -p -a 512 "$arg"
done' _ {} +
Using -exec ... {} + lets find invoke only one copy of sh per batch of files (as many as will fit on a command line on your local platform).
Even more importantly, not using {} inside the sh -c argument avoids command injection vulnerabilities, which with the original code would allow malicious filenames to run arbitrary commands (especially important when you're running under sudo, so those commands would be executed as root!).
The problem isn't that you were using xargs. It's that your find was run with sudo but any process receiving your piped or redirected output was not run with sudo, so your permissions during the find do not match your permissions during the subsequent xargs execution.
So, for example, instead of running:
sudo ls -al >> list.txt
you should instead run the entire pipeline of commands with sudo, as follows:
sudo sh -c 'ls -al >> list.txt'
UPDATE: NOT RECOMMENDED - see comment below.
I seem to have a variant that works now without xargs.
sudo find /Users \( -iname '*.JPG' -or -iname '*.NEF' -or -iname '*.PNG' -or -iname '*.RAF'
-or -iname '*.PW2' -or -iname '*.DNG' \) -type f -and -size +10000k
-exec sh -c '{ stat -n -f"MACBOOK %z " "{}"; shasum -p -a 512 "{}"; }' {} \;

Unix Find Command Ignoring Missing File or Directory

I have this little command that delete all files within ~/Library/Cache, ~/Library/Logs, /Library/Cache and /Library/Logs directories but sometimes, one or more directories are missing and the rm -rf command is not execute.
sudo find ~/Library/Caches ~/Library/Logs /Library/Caches /Library/Logs -mindepth 1 -type f -exec rm -rf {}+
I wanted the command to ignore missing directories and just execute the command to the files that are found.
The only issue here is that you are annoyed with seeing the error message about missing directories.
You may redirect the standard error stream to /dev/null to ignore errors:
sudo find ~/Library/Caches ~/Library/Logs \
/Library/Caches /Library/Logs \
-mindepth 1 -type f -exec rm -rf {} + 2>/dev/null
Also note that -mindepth 1 is not needed here, and that some find implementations have -delete:
sudo find ~/Library/Caches ~/Library/Logs \
/Library/Caches /Library/Logs \
-type f -delete 2>/dev/null
Or, with a shell that understands brace expansions:
sudo find {~,}/Library/{Logs,Caches} -type f -delete 2>/dev/null

Searching Directories and Removing Folders and Files in Bash

I have a bash script that goes into a components/ and runs the following command:
cp -R vendor/* .
I then have a second command that traverses any folder, accept the vendor folder , inside the components directory lookinf got .git/, '.gitignore' and Documentation/ and removes them. How ever:
I don't thinks it's recursive
It doesn't actually remove those files and directories either because of the top point or because of permissions (should I add a sudo)?
A directory copied from vendor might look like:
something/
child-directory/
.git/ // -- Should be removed.
The command in question is:
find -name vendor -prune -o \( -name ".git" -o -name ".gitignore" -o -name "Documentation" \) -prune -exec rm - rf "{}" \; 2> /dev/null || true
Now if it is a permission error, I wont know about it because I want it to ignore any errors and continue with the script.
Any thoughts?
I think the problem is in the option -prune. Anyways, this might work for you...
find vendor -name '.git' -o -name '.gitignore' -o -name 'Documentation' | xargs rm -rf

How to quickly find all git repos under a directory

The following bash script is slow when scanning for .git directories because it looks at every directory. If I have a collection of large repositories it takes a long time for find to churn through every directory, looking for .git. It would go much faster if it would prune the directories within repos, once a .git directory is found. Any ideas on how to do that, or is there another way to write a bash script that accomplishes the same thing?
#!/bin/bash
# Update all git directories below current directory or specified directory
HIGHLIGHT="\e[01;34m"
NORMAL='\e[00m'
DIR=.
if [ "$1" != "" ]; then DIR=$1; fi
cd $DIR>/dev/null; echo -e "${HIGHLIGHT}Scanning ${PWD}${NORMAL}"; cd ->/dev/null
for d in `find . -name .git -type d`; do
cd $d/.. > /dev/null
echo -e "\n${HIGHLIGHT}Updating `pwd`$NORMAL"
git pull
cd - > /dev/null
done
Specifically, how would you use these options? For this problem, you cannot assume that the collection of repos is all in the same directory; they might be within nested directories.
top
repo1
dirA
dirB
dirC
repo1
Check out Dennis' answer in this post about find's -prune option:
How to use '-prune' option of 'find' in sh?
find . -name .git -type d -prune
Will speed things up a bit, as find won't descend into .git directories, but it still does descend into git repositories, looking for other .git folders. And that 'could' be a costly operation.
What would be cool is if there was some sort of find lookahead pruning mechanism, where if a folder has a subfolder called .git, then prune on that folder...
That said, I'm betting your bottleneck is in the network operation 'git pull', and not in the find command, as others have posted in the comments.
Here is an optimized solution:
#!/bin/bash
# Update all git directories below current directory or specified directory
# Skips directories that contain a file called .ignore
HIGHLIGHT="\e[01;34m"
NORMAL='\e[00m'
function update {
local d="$1"
if [ -d "$d" ]; then
if [ -e "$d/.ignore" ]; then
echo -e "\n${HIGHLIGHT}Ignoring $d${NORMAL}"
else
cd $d > /dev/null
if [ -d ".git" ]; then
echo -e "\n${HIGHLIGHT}Updating `pwd`$NORMAL"
git pull
else
scan *
fi
cd .. > /dev/null
fi
fi
#echo "Exiting update: pwd=`pwd`"
}
function scan {
#echo "`pwd`"
for x in $*; do
update "$x"
done
}
if [ "$1" != "" ]; then cd $1 > /dev/null; fi
echo -e "${HIGHLIGHT}Scanning ${PWD}${NORMAL}"
scan *
I've taken the time to copy-paste the script in your question, compare it to the script with your own answer. Here some interesting results:
Please note that:
I've disabled the git pull by prefixing them with a echo
I've removed also the color things
I've removed also the .ignore file testing in the bash solution.
And removed the unecessary > /dev/null here and there.
removed pwd calls in both.
added -prune which is obviously lacking in the find example
used "while" instead of "for" which was also counter productive in the find example
considerably untangled the second example to get to the point.
added a test on the bash solution to NOT follow sym link to avoid cycles and behave as the find solution.
added shopt to allow * to expand to dotted directory names also to match find solution's functionality.
Thus, we are comparing, the find based solution:
#!/bin/bash
find . -name .git -type d -prune | while read d; do
cd $d/..
echo "$PWD >" git pull
cd $OLDPWD
done
With the bash shell builting solution:
#!/bin/bash
shopt -s dotglob
update() {
for d in "$#"; do
test -d "$d" -a \! -L "$d" || continue
cd "$d"
if [ -d ".git" ]; then
echo "$PWD >" git pull
else
update *
fi
cd ..
done
}
update *
Note: builtins (function and the for) are immune to MAX_ARGS OS limit for launching processes. So the * won't break even on very large directories.
Technical differences between solutions:
The find based solution uses C function to crawl repository, it:
has to load a new process for the find command.
will avoid ".git" content but will crawl workdir of git repositories, and loose some
times in those (and eventually find more matching elements).
will have to chdir through several depth of sub-dir for each match and go back.
will have to chdir once in the find command and once in the bash part.
The bash based solution uses builtin (so near-C implementation, but interpreted) to crawl repository, note that it:
will use only one process.
will avoid git workdir subdirectory.
will only perform chdir one level at a time.
will only perform chdir once for looking and performing the command.
Actual speed results between solutions:
I have a working development collection of git repository on which I launched the scripts:
find solution: ~0.080s (bash chdir takes ~0.010s)
bash solution: ~0.017s
I have to admit that I wasn't prepared to see such a win from bash builtins. It became
more apparent and normal after doing the analysis of what's going on. To add insult to injuries, if you change the shell from /bin/bash to /bin/sh (you must comment out the shopt line, and be prepared that it won't parse dotted directories), you'll fall to
~0.008s . Beat that !
Note that you can be more clever with the find solution by using:
find . -type d \( -exec /usr/bin/test -d "{}/.git" -a "{}" != "." \; -print -prune \
-o -name .git -prune \)
which will effectively remove crawling all sub-repository in a found git repository, at the price of spawning a process for each directory crawled. The final find solution I came with was around ~0.030s, which is more than twice faster than the previous find version, but remains 2 times slower than the bash solution.
Note that /usr/bin/test is important to avoid search in $PATH which costs time, and I needed -o -name .git -prune and -a "{}" != "." because my main repository was itself a git subrepository.
As a conclusion, I won't be using the bash builtin solution because it has too much corner cases for me (and my first test hit one of the limitation). But it was important for me to explain why it could be (much) faster in some cases, but find solution seems much more robust and consistent to me.
The answers above all rely on finding a ".git" repository. However not all git repos have these (e.g. bare repos). The following command will loop through all directories and ask git if it considers each to be a directory. If so, it prunes sub dirs off the tree and continues.
find . -type d -exec sh -c 'cd "{}"; git rev-parse --git-dir 2> /dev/null 1>&2' \; -prune -print
It's a lot slower than other solutions because it's executing a command in each directory, but it doesn't rely on a particular repository structure. Could be useful for finding bare git repositories for example.
I list all git repositories anywhere in the current directory using:
find . -type d -execdir test -d {}/.git \\; -prune -print
This is fast since it stops recursing once it finds a git repository. (Although it does not handle bare repositories.) Of course, you can change the . to whatever directory you want. If you need, you can change the -print to -print0 for null-separated values.
To also ignore directories containing a .ignore file:
find . -type d \( -execdir test -e {}/.ignore \; -prune \) -o \( -execdir test -d {}/.git \; -prune -print \)
I've added this alias to my ~/.gitconfig file:
[alias]
repos = !"find -type d -execdir test -d {}/.git \\; -prune -print"
Then I just need to execute:
git repos
To get a complete listing of all the git repositories anywhere in my current directory.
For windows, you can put the following into a batch file called gitlist.bat and put it on your PATH.
#echo off
if {%1}=={} goto :usage
for /r %1 /d %%I in (.) do echo %%I | find ".git\."
goto :eof
:usage
echo usage: gitlist ^<path^>
Check out the answer using the locate command:
Is there any way to list up git repositories in terminal?
The advantages of using locate instead of a custom script are:
The search is indexed, so it scales
It does not require the use (and maintenance) of a custom bash script
The disadvantages of using locate are:
The db that locate uses is updated weekly, so freshly-created git repositories won't show up
Going the locate route, here's how to list all git repositories under a directory, for OS X:
Enable locate indexing (will be different on Linux):
sudo launchctl load -w /System/Library/LaunchDaemons/com.apple.locate.plist
Run this command after indexing completes (might need some tweaking for Linux):
repoBasePath=$HOME
locate '.git' | egrep '.git$' | egrep "^$repoBasePath" | xargs -I {} dirname "{}"
This answer combines the partial answer provided #Greg Barrett with my optimized answer above.
#!/bin/bash
# Update all git directories below current directory or specified directory
# Skips directories that contain a file called .ignore
HIGHLIGHT="\e[01;34m"
NORMAL='\e[00m'
export PATH=${PATH/':./:'/:}
export PATH=${PATH/':./bin:'/:}
#echo "$PATH"
DIRS="$( find "$#" -type d \( -execdir test -e {}/.ignore \; -prune \) -o \( -execdir test -d {}/.git \; -prune -print \) )"
echo -e "${HIGHLIGHT}Scanning ${PWD}${NORMAL}"
for d in $DIRS; do
cd "$d" > /dev/null
echo -e "\n${HIGHLIGHT}Updating `pwd`$NORMAL"
git pull 2> >(sed -e 's/X11 forwarding request failed on channel 0//')
cd - > /dev/null
done

How can I properly check delete/write permissions for a folder hierarchy?

In a script I'm working on, I want to rm -rf a folder hierarchy. But before doing so, I want to make sure the process can complete successfully. So, this is the approach I came up with:
#!/bin/bash
set -o nounset
set -o errexit
BASE=$1
# Check if we can delete the target base folder
echo -n "Testing write permissions in $BASE..."
if ! find $BASE \( -exec test -w {} \; -o \( -exec echo {} \; -quit \) \) | xargs -I {} bash -c "if [ -n "{}" ]; then echo Failed\!; echo {} is not writable\!; exit 1; fi"; then
exit 1
fi
echo "Succeeded"
rm -rf $BASE
Now I'm wondering if there might be a better solution (more readable, shorter, reliable, etc).
Please note that I am fully aware of the fact that there could be changes in file access permissions between the check and the actual removal. In my use case, that is acceptable (compared to not doing any checks). If there was a way to avoid this though, I'd love to know how.
Are you aware of the -perm switch of find (if it is present in your version)?
find -perm -u+w

Resources