I have a bash script that checks whether the files to be committed fit a size limitation. However, when there are a large number of files, the script can take a long time to complete, even if there are no files that exceed the limit.
Here is the original script:
result=0
for file in $( git diff-index --ignore-submodules=all --cached --diff-filter=ACMRTUXB --name-only HEAD )
do
echo $file
if [[ -f "$file" ]]
then
file_size=$( git cat-file -s :"$file" )
if [ "$file_size" -gt "$max_allowed_packed_size" ]
then
echo File $file is $(( $file_size / 2**20 )) MB after compressing, which is larger than our configured limit of $(( $max_allowed_packed_size / 2**20 )) MB.
result=1
fi
fi
done
fi
exit $result
Do you have any idea to improve the performance of checking the staged files?
1.Use Git LFS (Large File Storage): Git LFS is an open-source Git extension that replaces large files with text pointers. This allows Git to handle large files more efficiently, which can speed up file size checking.
2.Ignore large files: You can also speed up file size checking by ignoring large files that are not necessary for the repository. You can do this by creating a .gitignore file in the root directory of your repository and adding patterns for the files or file types you want to ignore.
3.Use shallow cloning: Shallow cloning means that you only clone a certain number of commit histories from the remote repository. This can significantly reduce the amount of data you need to download and check, and can speed up file size checking.
4.Use Git hooks: Git hooks are scripts that run automatically when certain Git events occur, such as a commit or push. You can use a Git hook to check the file size of new or modified files and reject them if they exceed a certain size limit. This can help prevent large files from being added to the repository in the first place, which can save time on file size checking.
5.Use a faster computer or network: If your computer or network is slow, file size checking will naturally be slower. Upgrading your computer or network can help speed up file size checking.
$git config --global
Install git-sizer
$git-sizer --help
$git status #check status
Related
We want to prevent:
Very large text files (> 50MB per file) from being committed to git instead of git-lfs, as they inflate git history.
Problem is, 99% of them are < 1MB, and should be committed for better diffing.
The reason of variance in size: these are YAML files, they support binary serialization via base64 encoding.
The reason we can't reliably prevent binary serialization: this is a Unity project, binary serialization is needed for various reasons.
Given:
GitHub hosting's lack of pre-receive hook support.
git-lfs lack of file size attribute support.
Questions:
How can we reliably prevent large files from being added to commit?
Can this be done through a config file in repo so all users follow this rule gracefully?
If not, can this be done by bash command aliasing so trusted users can see a warning message when they accidentally git add a large file and it's not processed by git-lfs?
(Our environment is macOS. I have looked at many solutions and so far none satisfy our needs)
Alright, with helps from CodeWizard and this SO answer, I managed to create a good guide myself:
First, setup your repo core.hooksPath with:
git config core.hooksPath .githooks
Second, create this pre-commit file inside .githooks folder, so it can be tracked (gist link), then remember to give it execution permission with chmod +x.
#!/bin/sh
#
# An example hook script to verify what is about to be committed.
# Called by "git commit" with no arguments. The hook should
# exit with non-zero status after issuing an appropriate message if
# it wants to stop the commit.
#
# To enable this hook, rename this file to "pre-commit".
# Redirect output to stderr.
exec 1>&2
FILE_SIZE_LIMIT_KB=1024
CURRENT_DIR="$(pwd)"
COLOR='\033[01;33m'
NOCOLOR='\033[0m'
HAS_ERROR=""
COUNTER=0
# generate file extension filter from gitattributes for git-lfs tracked files
filter=$(cat .gitattributes | grep filter=lfs | awk '{printf "-e .%s$ ", $1}')
# before git commit, check non git-lfs tracked files to limit size
files=$(git diff --cached --name-only | sort | uniq | grep -v $filter)
while read -r file; do
if [ "$file" = "" ]; then
continue
fi
file_path=$CURRENT_DIR/$file
file_size=$(ls -l "$file_path" | awk '{print $5}')
file_size_kb=$((file_size / 1024))
if [ "$file_size_kb" -ge "$FILE_SIZE_LIMIT_KB" ]; then
echo "${COLOR}${file}${NOCOLOR} has size ${file_size_kb}KB, over commit limit ${FILE_SIZE_LIMIT_KB}KB."
HAS_ERROR="YES"
((COUNTER++))
fi
done <<< "$files"
# exit with error if any non-lfs tracked files are over file size limit
if [ "$HAS_ERROR" != "" ]; then
echo "$COUNTER files are larger than permitted, please fix them before commit" >&2
exit 1
fi
exit 0
Now, assuming you got both .gitattributes and git-lfs setup properly, this pre-commit hook will run when you try to git commit and make sure all staged files not tracked by git-lfs (as specified in your .gitattributes), will satisfy the specified file size limit.
Any new users of your repo will need to setup core.hooksPath themselves, but beyond that, things should just work.
Hope this helps other Unity developers fighting with growing git repo size!
How can we reliably prevent large files from being added to commit?
Can this be done through a config file in the repo so all users follow this rule gracefully?
Since GitHub doesn't support server-side hooks you can use client-side hooks. As you probably aware, those hooks can be passed and be disabled with no problem, but still, this is a good way to do it.
core.hooksPath
Git v2.9 added the ability to set the client hooks on remote folder. Prior to that, the hooks must have been placed inside the .git folder.
This will allow you to write scripts and put them anywhere. I assume you know what hooks are but if not feel free to ask.
How to do it?
Usually, you place the hooks inside your repo (or any other common folder).
# set the hooks path. for git config, the default location is --local
# so this configuration is locally per project
git config core.hooksPath .githooks
I have a number of scripts that I use almost everyday in my work. I develop and maintain these on my personal laptop. I have a local git repository where I track the changes, and I have a repository on github to which I push my changes.
I do a lot of my work on a remote supercomputer, and I use my scripts there a lot. I would like to keep my remote /home/bin updated with my maintained scripts, but without cluttering the system with my repository.
My current solution does not feel ideal. I have added the following code belowto my .bashrc. Whenever I log in, my repository will be deleted, and I then clone my project from github. Then I copy the script files I want to my bin, and make them executable.
This sort of works, but it does not feel like an elegant solution. I would like to simply download the script files directly, without bothering with the git repository. I never edit my script files from the remote computer anyway, so I just want to get the files from github.
I was thinking that perhaps wget could work, but it did not feel very robust to include the urls to the raw file page at github; if I rename the file I suppose I have to update the code as well. At least my current solution is robust (as long as the github link does not change).
Code in my .bashrc:
REPDIR=mydir
if [ -d $REPDIR ]; then
rm -rf $REPDIR
echo "Old repository removed."
fi
cd $HOME
git clone https://github.com/user/myproject
cp $REPDIR/*.py $REPDIR/*.sh /home/user/bin/
chmod +x /home/user/bin/*
Based on Kent's solution, I have defined a function that updates my scripts. To avoid any issues with symlinks, I just unlink everything and relink. that might just be my paranoia, though....
function updatescripts() {
DIR=/home/user/scripts
CURR_DIR=$PWD
cd $DIR
git pull origin master
cd $CURR_DIR
for file in $DIR/*.py $DIR/*.sh; do
if [ -L $HOME/bin/$(basename $file) ]; then
unlink $HOME/bin/$(basename $file)
fi
ln -s $file $HOME/bin/$(basename $file)
done
}
on that remote machine, don't do rm then clone, keep the repository somewhere, just do pull. Since you said you will not change the files on that machine, there won't be conflicts.
For the scripts files. Don't do cp, instead, create symbolic links (ln -s) to your target directory.
I am attempting to run 'git rm -rf --cached .' along with 'git add .' to remove cached files that are now listed in the .gitignore. I use Visual Studio on a windows computer, and prefer to leave line endings just as they are for this particular situation.
I tried setting core.autocrlf to false using git config command. I tried creating a .gitattributes with the line '* -text', rm'ing the .git/index, and running git reset. So far, every time I add the files back, I get a huge list of modified files.
EDIT: The change in the files is not actually line endings, it is changes in file permissions which I did not request.
Edit: the remaining problem is that the file modes are apparently not stored properly in Windows systems (see also What is git's "filemode"?). To save and restore them, one will need a script, plus the original data:
git ls-files --stage > /tmp/original
To recover the modes, this rather crude pipeline should work:
< /tmp/original \
awk -F$'\t' '/^100755 / { print "git update-index --chmod=+x \"" $2 "\"" }' |
sh
This will attempt to chmod +x files that have been removed by the below sequence, so you can expect some error messages if there are any such files. (It also assumes no files have double quotes in their names.)
Assuming you do not already have a .gitattributes file, here is a six step process that should work:
Create that .gitattributes file just as you did
Run rm .git/index
Run git checkout HEAD -- .
Run git rm -r --cached .
Run git add .
Run git rm .gitattributes (you can leave this until after verifying that it all worked). Run git commit afterward.
I do not have (nor use) Windows so cannot test this, but here's the theory behind why it should work, and hence why there are these steps.
Git's actual data storage format is a special, Git-only, compressed (sometimes highly compressed) format. Files stored in this format are mainly useful only to Git itself. This format stores a raw, uninterpreted byte stream: files do not have to be separated into "text" and "data" and so on, they are just raw byte streams (hence treated as "data" / "non-text"). The data, once stored, are read-only and get assigned a hash ID (currently SHA-1 though a future Git may use SHA-256). Git calls a file stored this way a blob, which is a term stolen from the database world.
Your computer's useful-file-storage format is of course different, and may (and does on Windows) make a distinction between "text" and "data". Text may have encodings (such as ISO-8859-1, UTF-8, UTF-16, and so on). These files are generally both readable and writable and anything on your computer can deal with them (to some degree anyway, depending on encoding).
Git has to extract files from commits, turning them from blobs into files that you can work with. These files live in your work-tree. You work with them, and then git add them to give Git a chance to re-blob-ize them.
In between these special Git-only blobs and the work-tree, Git needs a place to store the blobbed data, that—unlike a commit—is writable, but that—like a commit—has the file in the special Git-only format. This "in between" place is Git's index. Various bits of Git documentation sometimes call this the staging area or the cache.
Git uses the index copy of each file (or blob, really) to make new commits. When you run git add, Git reads the work-tree file, encodes it down into the blob form, and saves it—well, its hash ID, really—in the index. When you run git commit, Git simply freezes the index copies into committed copies.
When you run git checkout to switch to some commit, Git extracts the commit into the index (filling in all the blob hash IDs), and also extracts the blobs into the work-tree so that they are in useful format and you can work on them. When you run git add, Git compresses the work-tree file into its blob format and replaces the index entry for the file.
Transforming a blob into a work-tree file, or vice versa, is the ideal place where Git will do any conversions you need, such as turning newlines into CRLF line endings. So that's where Git does it: git checkout fills the index and expands-and-converts into the work-tree, and git add compresses-and-un-converts from the work-tree into the index, ready for the next git commit. (Any files you don't touch, stay compressed and ready to go, safely tucked away in the index.)
You already know that a tracked file is one that is in the index, and an untracked file is one that is in the work-tree but not in the index. Your goal is to use the existing .gitignore to make files that are currently in the index go away from the index if they would be .gitignore-ed. The process you are using is:
git rm -r --cached .: remove everything from the index, so that the entire work-tree is untracked
git add .: produce all new blobs in the index from whatever is in the work-tree, while ignoring any file that is listed in .gitignore.
The issue here is that what's in the work-tree has been converted by the "blob to work-tree" conversions, and will be "un-converted" by the "work-tree to blob" conversions. Creating a .gitattributes file with * -text tells Git: The conversions to do are no conversions at all."
Unfortunately, it's too late: the git checkout you ran earlier, to get this commit into the work-tree, already did some conversions.
So here, we use step 1 to create a .gitattributes file that says do no conversions. Step 2, rm .git/index, removes the index entirely. Git now has no idea what's actually in the work-tree. This step may be unnecessary but I use it to force Git to act in step 3, which tells Git: extract every file from the HEAD commit into the index and the work-tree. This re-creates the index, and re-fills the work-tree, this time doing no conversions.
Steps 4 and 5 are just as before, but this time, the work-tree files all match the blobs in the HEAD commit since step 3 operated with the .gitattributes directive in place. Step 6 is to make sure you do not commit the "do no conversions" directive.
I have a small Bash script which includes some Git commands. (For certain reasons, I cannot use git hooks here.)
Basically, it does
git pull origin <<some repo>> || { echo "Git pull FAILED"; exit 1; }
# do something with the new/changed files on the file system
In not reproducible cases, this fails. In these cases, old versions of the files (being at the state before git pull) are used instead of the new files (at the state after git pull).
However, if I manually do git pull and afterward run the other command, there was never any problem.
So, I'm now wondering if there is any delay/asynchronicity in Git changing the files on the file system after a pull. If yes: How can I deal with it (maybe avoiding sleep or something like that)? If not: What else could cause the confusion of file versions here?
I have been working on how to verify that millions of files that were on file system A have infact been moved to file system B. While working on a system migration, it became evident that all the files needed to be audited to prove that the files have been moved. The files were initially moved via rsync, which does provide logs, although not in a format that is helpful for doing an audit. So, I wrote this script to index all the files on System A:
#!/bin/bash
# Get directories and file list to be used to verify proper file moves have worked successfully.
LOGDATE=`/usr/bin/date +%Y-%m-%d`
FILE_LIST_OUT=/mounts/A_files_$LOGDATE.txt
MOUNT_POINTS="/mounts/AA mounts/AB"
touch $FILE_LIST_OUT
echo TYPE,USER,GROUP,BYTES,OCTAL,OCTETS,FILE_NAME > $FILE_LIST_OUT
for directory in $MOUNT_POINTS; do
# format: type,user,group,bytes,octal,octets,file_name
gfind $directory -mount -printf "%y","%u","%g","%s","%m","%p\n" >> $FILE_LIST_OUT
done
The file indexing works fine and takes about two hours to index ~30 million files.
On side B is where we run into issues. I have written a very simple shell script that reads the index file, tests to see if the file is there, and then counts up how many files are there, but it's running out of memory while looping through the 30 million lines on indexed file names. Effectively doing this little bit of code below through a while loop, and counters to increment for files found and not found.
if [ -f "$TYPE" "$FILENAME" ] ; then
print file found
++
else
file not found
++
fi
My questions are:
Can a shell script do this type of reporting from such a large list. A 64 bit unix system ran out of memory while trying to execute this script. I have already considered breaking up the input script into smaller chunks to make it faster. Currently it can
If as shell script is inappropriate, what would you suggest?
You just used rsync, use it again...
--ignore-existing
This tells rsync to skip updating files that already exist on the destination (this does not ignore existing directories, or nothing would get done). See also --existing.
This option is a transfer rule, not an exclude, so it doesn’t affect the data that goes into the file-lists, and thus it doesn’t affect deletions. It just limits the files that the receiver requests to be transferred.
This option can be useful for those doing backups using the --link-dest option when they need to continue a backup run that got interrupted. Since a --link-dest run is copied into a new directory hierarchy (when it is used properly), using --ignore existing will ensure that the already-handled files don’t get tweaked (which avoids a change in permissions on the hard-linked files). This does mean that this option is only looking at the existing files in the destination hierarchy itself.
That will actually fix any problems (at least in the same sense that any diff-list on file-exist tests could fix problem. Using --ignore-existing means rsync only does the file-exist tests (so it'll construct the diff list as you request and use it internally). If you just want information on the differences, check --dry-run, and --itemize-changes.
Lets say you have two directories, foo and bar. Let's say bar has three files, 1,2, and 3. Let's say that bar, has a directory quz, which has a file 1. The directory foo is empty:
Now, here is the result,
$ rsync -ri --dry-run --ignore-existing ./bar/ ./foo/
>f+++++++++ 1
>f+++++++++ 2
>f+++++++++ 3
cd+++++++++ quz/
>f+++++++++ quz/1
Note, you're not interested in the cd+++++++++ -- that's just showing you that rsync issued a chdir. Now, let's add a file in foo called 1, and let's use grep to remove the chdir(s),
$ rsync -ri --dry-run --ignore-existing ./bar/ ./foo/ | grep -v '^cd'
>f+++++++++ 2
>f+++++++++ 3
>f+++++++++ quz/1
f is for file. The +++++++++ means the file doesn't exist in the DEST dir.
Here is the bonus, remove --dry-run, and, it'll go ahead and make the changes for you.
Have you considered a solution such as kdiff3, which will diff directories of files ?
Note the feature for version 0.9.84
Directory-Comparison: Option "Full Analysis" allows to show the number
of solved vs. unsolved conflicts or deltas vs. whitespace-changes in
the directory tree.
There is absolutely no problem reading a 30 million line file in a shell script. The reason why your process failed was most likely that you tried to read the file entirely into memory, e.g. by doing something wrong like for i in $(cat file).
The correct way of reading a file is:
while IFS= read -r line
do
echo "Something with $line"
done < someFile
A shell script is inappropriate, yes. You should be using a diff tool:
diff -rNq /original /new
If you're not particular about the solution being a script, you could also look into meld, which would let you diff directory trees quite easily and you can also set ignore patterns if you have any.