How does git detect modified files on Windows? - windows

(This is not a duplicate of How does git detect that a file has been modified? because I'm asking about Windows, the referenced QA mentions stat and lstat, which do not apply to Windows).
With traditional systems like SVN and TFS, the "state database" needs to be explicitly and manually informed of any changes to files in your local workspace: files are read-only by default so you don't accidentally make a change without explicitly informing your SVN/TFS client first. Fortunately IDE integration means that operations that result in the addition, modification, deletion and renaming (i.e. "checking-out") of files can be automatically passed on to the client. It also means that you would need something like TortoiseSVN to work with files in Windows Explorer, lest your changes be ignored - and that you should regularly run an often lengthy Server-to-Local comparison scan to detect any changes.
But Git doesn't have this problem - on my Windows machine I can have a gigabyte-sized repo with hundreds of thousands of files, many levels deep, and yet if I make a 1 byte change to a file nested very deeply, I can see that Git knows after running git status. This is the strange part - because git does not use any daemon processes or background tasks - running git status also does not involve any significant IO activity that I can see, I get the results back immediately, it does not thrash my disk searching for the change I made.
Additionally, Git GUI tools, such as the Git integration with Visual Studio 2015 also have some degree of magic in them - I can make a change in Notepad or another program, and VS' Git Changes window picks it up immediately. VS could simply be using ReadDirectoryChanges (FileSystemWatcher) - but when I look at the devenv process in Process Explorer I don't see any corresponding handles, but that also doesn't explain how git status sees the changes.

Git runs a Windows equivalent of the POSIX-y lstat(2) call on each file recorded in the index to have the first stab at figuring out whether the file is modified or not. It compares the modification time and size taken from that information with the values recorded for that file in the index.
This operation is notoriously slow on NTFS (and network-mapped drives) so since some time Git for Windows gained a special tweak controlled with the core.fscache configuration option which became enabled by default some 2 or 3 GfW releases ago. I don't know the exact details but it tries to minimize the number of times Git needs to lstat(2) your files.
IIUC, the mechanism enabled by core.fscache is not making use of filesystem watching Win32 API as Git runs no daemons/services on your system; so it merely optimizes the way Git asks the filesystem layer about the stat info of the tracked files.

As Briana Swift and kostix point out - it is scanning your disk. However, when looking for unstaged changes, it does not need to read every file on your disk. Instead, it can look at the metadata stored in the index to determine what files to examine more closely (actually reading them).
If you use the git-ls-files command to examine the index, you can see this metadata:
% git ls-files --debug worktree.c
worktree.c
ctime: 1463782535:0
mtime: 1463782535:0
dev: 16777220 ino: 120901250
uid: 501 gid: 20
size: 5591 flags: 0
Now if you run git status, git will look at worktree.c on disk. If the timestamps and filesize match, then git will assume that you have not changed this file.
If, however, the timestamps and filesize do not match, then git will look more closely at the file to determine if you have changed it or not.
So git does "thrash" the disk, but in a much more limited manner than if you did something like tf reconcile to examine your changes. (TFVC, of course, was designed to deal with very large working trees and should never touch your disk if you're using it correctly.)
And yes - Visual Studio does have some magic in it. It runs a background filesystem watcher in both your working directory and some parts of the Git repository. When it notices a change in your working directory, it will re-compute the git status. It also looks at changes to branches in the Git repository to know when you've switched branches or to recompute the status of your local repository with your remote.

Git's process of git status is very lightweight.
git status checks the index (also known as staging area, before you run git add) and the working directory (after git add but before git commit), then compares those files with the last committed version. Instead of having to go through every file in the repository, Git first checks these areas to see what to look up in the most recent commit.
git diff works similarly. I suggest looking here for more information.

Related

Git for Windows - prevent .pack file date/time modification

I am using Git for Windows (version 2.15, but the same issue occurs in 2.14 and I think older versions as well) and I noticed a rather annoying behavior: When I perform some basic git operations*), the modification date of the .git/objects/pack/pack-*.pack file changes. The file itself remains unchanged, but the last modification date field gets updated, which causes my backup software to think the file was changed and needs to be added to my differential backup. Because my .pack files are rather large, this increases the size of my daily backups significantly. Is there a way to prevent this behavior? That is, keep the pack file completely unchanged, including its metadata, until I perform a git gc or git repack?
Unfortunately, I wasn't able to pinpoint which operation causes this behavior. When it happened today, I only used git status, git log, git add, git mv and git commit and nothing else and the date/time got changed, but when I tried to replicate the behavior on my yesterday's backup, the date change didn't occur. I guess next time I will run Process Monitor and watch accesses to the file, but in the meanwhile, does anyone have an idea of what might be causing this problem? Thanks.
Instead of referencing your Git repo itself for your backup program to process (with the date issue), you could have:
a task which does a git bundle of your repo (that generates only one file)
your backup program would back up only that one file.
That way, you bypass entirely the modification date issue for those pack files.
You can either save and keep only one copy of a full bundle of the repo.
Or make incremental bundles.
In the end it turns out that Edward Thomson's answer explains why no "real" solution is possible. However, to facilitate my needs, I wrote a simple Windows command-line application which scans through a tree of directories, locates possible Git repositories, locates their packfiles and changes the date/time of each .pack file to that of the respective .idx file. So far it seems to run OK. I did not encounter any garbage collection issues yet, anyway. I did not release the tool yet, because I rather suspect no one else cares, but if someone is interested, I can upload it somewhere.
Apparently, someone is interested. So the program is released as of now. Not on GitHub, but still as open source, under the 3-clause BSD license. Download the binaries here: https://www.pepak.net/files/git/gitpacksync-0.01.zip
and the source code here: https://www.pepak.net/files/git/gitpacksync-0.01-source.zip
If you try to disable this then you would be prone to see subtle bugs where objects that are still in use will disappear from your repository.
You had trouble pinpointing the exact operation because every operation that adds files will do it.
This is very much intentional - Git refreshes the timestamps of objects in the database (updating the timestamp on either loose objects or packfiles) to know when an object was last written. Whenever you create a new commit, it will update the timestamp on all the files that contain objects hat were referenced.
This is important as it helps the tools that remove data (like prune) avoid race conditions: an object may be dereferenced and then re-referenced. Prune will also look at the timestamp, so by touching the file, it will not be eligible for garbage collection.

Safely using junction or mklink /j with a git repository on Windows

Using Git on Windows, I'm trying to deal with content that's external to my git repo. We have artwork and content files for instance that are being updated by non git-users in google drive so to capture these changes I've setup something similar to the following;
d:\MyRepo
\.git
\code1
\images1
\fonts (junction) => c:\users\%username%\google drive\designerLtd\fonts
\etc
Where 'fonts' is a folder has been linked using either junction.exe or mklink /j (same thing). This generally works out great because Git status immediately highlights new changes (either on purpose or by accident) and prepares them for checkin or undo.
ISSUE: sometimes when switching branches Git prunes the linked directories and re-creates them if content in those folders is different between branches. In effect it breaks the link. Now Git is always correct and the build is consistent but it's not always obvious that it is no longer keeping track of those external resources.
Worse still, it can delete files in the external location. They can be recovered from git of course, but it's very unwieldy.
Swapping the content in the external locations when branches are switched isn't a problem, because there's only one PC that's hooked up this way and they're easily merged, but I just wish it didn't break the links.
QUESTION: Is there a better way to allow external junction points within a Git repo on Windows?
To be clear, there are no symlinks in the GIT repository (yet) as far as I know and this isn't a question about interoperability between Unix and Windows git clients (which most of the other questions on SO seem to relate to).
You can modify permissions of the junction point so that git can no longer delete it. Git usually doesn't care if removing a directory fails (except if it needs to replace the directory with a file).
See "Usage Recommendations" in https://support.microsoft.com/en-us/kb/205524

Git: don't update certain files on Windows

We work with a git respository that has over 20,000 files.
My group maintains local versions of about 100 or so of configuration and source files from this repository. THe original acts as a sort of base that several groups modify and tweak to their own needs (some core things are not allowed to be changed, but front end and some custom DB stuff are different between groups)
So we want to update to the latest version generally, but not have the git update overwrite the files that we keep local modifications for.
The machines we use are windows based. Currently the repository gets cloned to a windows server that then gets checked out/cloned to the development machines (which are also windows). The developers make changes as necessary and recommit to our local repo. The local repo updates against the master daily. We never commit back to the master.
So we want all the files that haven't been changed by our group to update, but any that have been changed (ever) won't get updated.
Is there a way to allow this to happen automatically, so the windows server just automatically updates daily, ignoring those files we keep modifications for. And if we want to add a new file to this "don't update" list its just a right-click (or even a flat file list away). I looked at git-ignore but it seems to be for committing, not for updating.
Even better would be a way to automatically download the vanilla files but have them renamed automatically. For example settings.conf is a file we want to keep changes on generally, but if they modify the way entries in that file are handled or add extra options it would be nice it it downloaded it as settings.conf.vanilla or something so we just run a diff on .vanilla files against ours and see what we want to keep. Though this feature is not absolutely necessary and seems unlikely.
If this cannot be accomplished on a windows machine (the software for windows doesn't support such features), please list some Linux options as well if available. We do have an option to use a Linux server for hosting the local git repo if needed.
Thanks.
It sounds like you're working with a third party code base that's under active development and you have your own customisations which you need to apply.
I think the answer you're looking for is rebase. You shouldn't need to write any external logic to achieve this, except for a job which regularly pulls in the third party changes and rebases your modifications on top of them.
This should also be more correct than simply ignoring the files you've modified, as you won't then accidentally ignore changes that the third party has made to those files (you may sometimes get a conflict, which could be frustrating, but better than silently missing an important change).
Assuming that your local repo is indeed simply a fork, maintain your changes on your own branch, and every time you update the remote repository, simply rebase your local branch on top of those changes:
git pull origin master
git checkout custom_branch
git rebase master
Edit
After you've done this, you'll end up with all the changes you made on your custom_branch sitting on top of master. You can then continue to make your customisations on your own branch, and development of the third party code can continue independently.
The next time you want to pull in the extra changes, you'll repeat the process:
Make sure you're on the master branch before pulling in changes to the third party code:
git checkout master
Pull in the changes:
git pull origin master
Change to your customised branch:
git checkout custom_branch
Move your changes on top of the third party changes:
git rebase master
This will then put all your own changes on top of master again. master itself won't be changed.
Remember that the structure of your repo just comes from a whole set of "hashes" which form a tree. Your branches are just like "post it" notes which are attached to a particular hash, and can be moved to another hash as your branch grows.
The rebase is like chopping off a branch and re-attaching it somewhere else. In this case, you're saying something like "chop off our changes and re-attach them on top of the main trunk".
If you can install a visual tool like GitX, it will really help to see how the branch tags move around when you rebase. The command line is ideal for working with but I find something like GitX is invaluable for getting a handle on the structure of your repo.

Make commits to a remote Git repo without a local one?

My question is simple:
I have a project with a local Git repo, and I have pushed it to Bitbucket. What I'm trying to do is to remove the local repo and commit my project just to the remote repo, so that I don't have a double-sized project on my hard drive.
Is there any good solution for this?
More details
I'm worrying that the .git folder might drain my hard drive. Creating my local Git repo keeps all the files, and I ended up creating a project that's twice as big. The app deals with media files...it's 500 MB without Git.
Creating local git keeps all files
Yes, that's what git does. From a cursory Google search:
Rather than a single, central repository on which clients synchronize, each peer's working copy of the codebase is a bona fide repository ... [this] results in some important differences from a centralized system
...
Each working copy effectively functions as a remote backup of the codebase and of its change-history, protecting against data loss
...
Allows users to work productively when not connected to a network.
Makes most operations much faster.
Allows private work, so users can use their changes even for early drafts they do not want to publish.
...
Avoids relying on one physical machine as a single point of failure.
As for your "problem"
I'm worrying about that .git folder might drain my hard drive.
A git repository of Firefox is 200 MB. Consider the size of your project relative to Firefox, and then be prepared to set aside a generous ten thousand, two-hundred and forty kilobytes for your project's git repository.
I'd like to point out a few things:
Git can only make commits to a local repo.
Git compresses files in its repo.
Git is ill-suited for versioning binaries.
See each section below for full explanation.
Git can only make commits to a local repo
There is no way I know of in Git to make a commit directly to a remote repository, without having to go through a local one first. That's not how Git works. If you want to make commits, I think you can only do so with a local repository.
Git compresses files in its repo
Files under the .git directory are compressed, so the Git repo at the .git directory will probably be much smaller than your working copy checkout, especially if it's just full of text and not binary files (more on binary files later). At work, I use a Git repo that's about 300 MB, but the working copy is around 2.5 GB, so the actual repo itself is much smaller in comparison.
Compression settings for Git
You can configure Git to use different compression levels:
core.compression
An integer -1..9, indicating a default compression level. -1 is the zlib default. 0 means no compression, and 1..9 are various speed/size tradeoffs, 9 being slowest. If set, this provides a default to other compression variables, such as core.loosecompression and pack.compression.
core.loosecompression
An integer -1..9, indicating the compression level for objects that are not in a pack file. -1 is the zlib default. 0 means no compression, and 1..9 are various speed/size tradeoffs, 9 being slowest. If not set, defaults to core.compression. If that is not set, defaults to 1 (best speed).
pack.compression
An integer -1..9, indicating the compression level for objects in a pack file. -1 is the zlib default. 0 means no compression, and 1..9 are various speed/size tradeoffs, 9 being slowest. If not set, defaults to core.compression. If that is not set, defaults to -1, the zlib default, which is "a default compromise between speed and compression (currently equivalent to level 6)."
Note that changing the compression level will not automatically recompress all existing objects. You can force recompression by passing the -F option to git-repack(1).
You can read more about Git packfiles from the free online Pro Git book.
Git is ill-suited for versioning binaries
Finally, the original poster makes this comment:
well...the app is dealing with media files... it's 500mb without git
Git is not well-suited to versioning binary files (like media files, like pictures, videos, audio clips, etc), because Git can't keep text diff deltas of changes to them like it can with text files, it actually has to keep each version of a binary in its entirety every time you make changes to it.
So if you have a 1 MB picture file called logo.jpg, and you make a small change to it, Git would have to store the whole logo.jpg file all over again, adding another 1 MB to your repository.
Solution 1: Remove binaries with git filter-branch
If your media files don't actually need to be versioned in Git, consider removing them from your repository using git filter-branch. You can read more about this option at the official Linux Kernel Git documentation for git filter-branch and at the "The Nuclear Option: filter-branch" of the free online Pro Git book.
Solution 2: use 3rd party services for media files instead
GitHub makes this suggestion for dealing with large media files:
Binary media files don't get along very well with Git. For these files it's usually best to use a service specifically designed for what you're using.
For large media files like video and music you should host the files yourself or using a service like Vimeo or Youtube.
For design files like PSDs and 3D models, a service like Dropbox usually works quite nicely. This is what GitHub's designers use to stay in sync; Only final image assets are committed into our repositories.
More about Git and versioning binary files
You can learn more about Git and versioning binaries in these other Stack Overflow questions:
Managing large binary files with git.
git with large files.

How do I ignore filemode changes in Git after conflict resolution?

I'm seriously about to stop coding and become a carpenter. This problem has had me stressed out for quite some time now and there doesn't seem to be any clear solution, other than forcing non-windows machines to use the file permissions windows seems to inflict.
Let's begin with the scenario. I have 2 development machines, one running Windows7, the other Mac OSX. They both are using Eclipse and EGit, and both have cloned the same project from a remote repo. That's where the similarities end, because my windows machine has a nasty habit of retaining a file mode of 644 (r-xr--r--) on its local repo, while the mode on the Mac defaults to a cool 775 (rwxrwxr--x).
So the problem's obviously the file permissions - GIT reports there are files that have changed due to differences in file modes and not actual content. The solution seemed obvious, run the following commands:
git config core.filemode false
git config --global core.filemode false
...which worked like a charm, except when committing and merging resolved conflicts.
For example, say files A, B and C exist on both the Windows and Mac repos. Next, let's change the file mode for these 3 files on the Mac machine so that the developer can edit them. Now change some of the contents in file A. Because we're ignoring the file modes (thanks to the commands above) only file A will be committed and pushed, ready for the Windows machine to fetch, merge and enjoy...
Now, let's edit file A again on the Mac and on the Windows machines, effectively causing a potential conflict, and let the Windows machine commit and push file A first. Then, when the Mac user commits their changes to file A and fetches the latest changes from the remote repo, a conflict is obviously created.
After resolving the conflict on the Mac machine and adding file A back to their local repo, committing that merge includes the previously ignored files B and C, and thus highlighting my problem! Why are the previously ignored files being included in this merge commit? This doesn't seem to be a Mac / Windows problem exclusively, as this problem can be recreated both ways...
This probably wouldn't matter if there were only 3 files, but this project I'm referring to includes thousands, and all these up and down push and pulls are insane. Am I missing something really obvious? Any help will be greatly appreciated!
So after a long and often frustrating run with trying to get Windows and Mac OS machines to like each other when comparing notes and changes over Git, it seems to just be the functionality of Git itself that's driven me mad, and a better understanding of how to better use Git seems to be the final answer.
Originally, I wanted to know how to continue ignoring file-mode changes (which Windows seemed to love updating with its own idea of what permissions and modes should be) while I was resolving conflicts created from updates to the files from someone else. Short answer: you can't. It's just the way Git works.
So how would you deal with this situation when it came up? Short answer again: use branching.
In my original way of using Git, I was continually working on my master branch on my local repo - bad idea already - so all my work would be committed there and all conflict resolution would need to be handled there too, forcing all files to be compared again and all permissions and file modes to come into question.
With branching, you work on another branch, commit to that branch, pull updates to your master branch, and merge you other branch with your master branch after that. Suddenly no more conflicts on the master from the remote repo, and you're winning!
Commands to do this (creating a branch from currently selected branch):
git branch newbranch
To checkout your new branch:
git checkout newbranch
To merge your new branch with your master branch (after you've committed to your newbranch, switch to the master first):
git checkout master
git merge newbranch
Hope this helps someone! -:)
well tbh, I'm not sure how you arrived at your conclusion since the obvious problem is git is not ignoring the file mode changes. This happens to us here too.
if you set that flag it seems to make no difference in some cases and still uses file modes to determine changed files.
that has to be a bug in git, what else could it be?
possibly your answer does not explain the rationale, therefore I dont think it's the correct answer, it's a workaround.
the correct answer is that git doesnt use filemode when it's told not to, it's obviously ignoring that and doing it anyway.
can you explain otherwise.

Resources