This is the umpteenth version of the extremely basic question "why the heck is Git telling me that files changed but diff shows no changes?". Similar questions have been posted here and here but none of those answers help.
My scenario is as follows:
I added a .gitattributes file to an existing Git repo with several already existing commits in it. The content of the .gitattributes file looks as follows:
* text=auto
*.bat text eol=crlf
*.cmd text eol=crlf
*.ps1 text eol=crlf
*.sh text eol=lf
*.csproj text eol=crlf
*.filters text eol=crlf
*.props text eol=crlf
*.sqlproj text eol=crlf
*.sln text eol=crlf
*.vcxitems text eol=crlf
*.vcxproj text eol=crlf
*.cs text
*.config text
*.jmx text
*.json text
*.sql text
*.tt text
*.ttinclude text
*.wxi text
*.wxl text
*.wxs text
*.xaml text
*.xml text
*.bmp binary
*.gif binary
*.ico binary
*.jpg binary
*.pdf binary
*.png binary
After adding that file I executed the following command:
git rm --cached -r .
git reset --hard
The result is that Git git status now shows most of the files in the Git repo as modified. However, I cannot see any changes in any of those files. The diff tool isn't showing any changes, neither in the text view nor in its hex view.
The repo has been created on a Windows machine and I'm currently using it on a Windows machine. The output of the command git config --list is as follows:
http.sslbackend=schannel
diff.astextplain.textconv=astextplain
credential.helper=manager-core
core.autocrlf=true
core.fscache=true
core.symlinks=false
core.editor="C:\\Program Files\\Notepad++\\notepad++.exe" -multiInst -notabbar -nosession -noPlugin
pull.rebase=false
credential.https://dev.azure.com.usehttppath=true
init.defaultbranch=master
user.name=My Name
user.email=my#email.whatever
core.autocrlf=true
core.eol=crlf
diff.tool=bc
difftool.bc.path=C:/Program Files/Beyond Compare 4/bcomp.exe
difftool.bc.cmd="C:/Program Files/Beyond Compare 4/bcomp.exe" "$LOCAL" "$REMOTE"
difftool.bc.prompt=false
merge.tool=bc
mergetool.bc.path=C:/Program Files/Beyond Compare 4/bcomp.exe
mergetool.bc.cmd="C:/Program Files/Beyond Compare 4/bcomp.exe" "$LOCAL" "$REMOTE" "$BASE" "$MERGED"
mergetool.bc.keepbackup=false
mergetool.bc.trustexitcode=true
core.repositoryformatversion=0
core.filemode=false
core.bare=false
core.logallrefupdates=true
core.symlinks=false
core.ignorecase=true
So the magic switches core.autocrlf and core.eol are as they should be for Windows as far as I could decrypt from the documentation.
Does anyone have a clue what Git landmine I've stepped on here?
There are multiple possibilities here, but the most common by far has to do with these CRLF line endings. It's complicated, and to really get it, we need some background first.
From a high level point of view, Git basically has two options:
Don't mess with line endings ever.
Do mess with line endings.
The first one is really simple, and is the default on all Unix-like systems. It's probably the default on Windows too, but I don't use Windows, so I'd have to defer to anyone else who says otherwise. In this setup, if you create a file and store, in that file, the byte-sequence:
h e l l o CTRL-M CTRL-J w o r l d CTRL-M CTRL-J
and then git add the file and run git commit, Git will store, in the repository, a new commit in which that file contains those 14 bytes. The blob hash ID will be:
$ printf 'blob 14\0hello\r\nworld\r\n' | shasum
23eb407b644b0e362fa224168ecd0adfa02b022a
This file has CRLF line endings. Extracting the commit will produce a file with CRLF line endings. The file in the repository is now read-only, frozen for all time; it has blob hash ID 23eb407b644b0e362fa224168ecd0adfa02b022a, as does every file in any Git repository anywhere in the universe, as long as that file contains exactly that text.
Now suppose, having created this file (or not), we turn on the "do mess with line endings" options. We now get numerous sub-options, specifying just how Git will go about messing with line endings, when, on which files. These include eol=crlf, eol=lf, text, binary, and so on:
*.bat text eol=crlf
*.sh text eol=lf
*.jpg binary
This fragment tells Git that if the file's name ends with .bat, Git should mess with line endings in one particular way; if it ends with .sh, Git should mess with line endings in another particular way; and if it ends with .jpg, Git should not mess with line endings.
We know that the binary specification means that for such files, Git doesn't mess with line endings. This is good since, for instance, .jpg files do not actually have lines in the first place, so that anything that resembles a line ending is just coincidence. When Git isn't messing with anything, it's all easy: Git is storing what's there and showing you what's stored.
But that's no longer true for the other files. Since Git is now messing with their line endings, it becomes important to ask and answer more questions:
When exactly does Git mess with the line endings?
What exactly does Git do when it does this messing-about?
This is where things get complicated. The key to understanding things here is to know about Git's index. This thing—this "index"—is central in Git and you really do have to know about it to use Git properly, so let's take a tour of the index.
Git's index
Git's index is either so important or so poorly named (or both) that it actually has three names. It is also called the staging area, which refers to how you normally use it, and it is sometimes called the cache. This last name is pretty rare these days: you mostly see it in flags like git rm --cached. (Some commands, like git diff, have both --staged and --cached, with the same meaning. For some reason no one has gotten around to adding git rm --staged yet. I thought that would have happened by now, and I still think it will happen someday.)
The index does a bunch of things for Git, but here we really care about what it does for—and to—you. What it does for you is hold your proposed next commit. Git is, fundamentally, not about files, but rather about commits. Each commit holds files: in fact, each commit has a full snapshot of every file. (Each commit also has some metadata, such as the name and email address of the commit's author, but we'll skip that here.)
The thing about commits, though, is that they're purely read-only. You can make new ones, but you can never change any existing commit. The git commit --amend command, for instance, fakes it: it does not change the existing commit, it makes a new one and stops using the old one in favor of the new one instead. When you can't tell the difference—and sometimes you can't—this is just as good. When you can tell the difference—and sometimes you can—the cracks show through.
But if you can't change a commit—and you can't—and if, as is also true, the files inside a commit are in a special, compressed, de-duplicated, Git-only form that no programs other than Git itself can even read in the first place, how can you use the files that are inside a commit? The answer is simple enough: In order to use a commit, you have to have Git extract that commit first. We run git checkout or git switch to achieve this. Git extracts the files from the commit, placing usable version of them in our working tree or work-tree, where we can see them and get our work done.
Git could stop here, with committed files—read-only inside the current commit, frozen for all time—and working files. Other version control systems do stop here. But Git doesn't. Instead, as it's extracting the commit, Git puts "copies" of each file into Git's index.
I put "copies" in quotes here because the files in Git's index are stored in the internal, compressed, de-duplicated format. Since they were just extracted from some commit, they take no space: they're de-duplicated away. They hold the same data in the index that they hold when they're inside the commit: this data is frozen for all time.
What's special about the index "copies" of files is that, unlike the committed copies, you can replace them. The git add command tells Git: compress and de-duplicate the working tree file. Git reads the working tree copy, compresses it, and checks to see if the compressed result is a duplicate of some existing file in any existing commit. (This is where that blob hash ID trick comes in: it's why any file consisting entirely of hello\r\nworld\r\n has hash ID 23eb407b644b0e362fa224168ecd0adfa02b022a.) If this is a duplicate, Git puts the duplicate's hash ID in the index. If it's not a duplicate, Git arranges to store a new blob in the object database,1 and stores the new blob's hash ID in the index.
Either way, after this update-the-index step, the proposed next commit is now updated. The file you git add-ed is now staged, and git status will compare the staged hash ID to the current-commit hash ID and say staged for commit if these hash ID's don't match. (This means that git add-ing a file that's been turned back to match the committed copy takes away the staged for commit message, even though the file will in fact be in the next commit. It's just that the hash IDs now match!)
So, Git's index holds this proposed next commit. To make a new commit, you:
futz with the files in your working tree;
run git add on them to copy them back into Git's index; and
run git commit to package up whatever is in Git's index right then.
This is why you have to keep git adding a file each time you change it: Git doesn't automatically copy the working tree file back into the index. Git only copies it back when you say to do that.2
The end effect—and what you should take into the next section—is that, at all times, Git has three copies of each file:
HEAD index work-tree
--------- --------- ---------
README.md README.md README.md
img.jpg img.jpg img.jpg
main.py main.py main.py
for instance. The work-tree version is the one you can see, read, write, feed to a JPG viewer, run with the Python program, and so on. The other two are for Git: the HEAD version is the frozen-for-all-time copy from the current commit and the index version is the malleable-but-frozen-format copy, ready to go into the next commit.
The git checkout or git switch command switches to some commit, copying the files out of the commit to Git's index and then to your working tree.
The git restore command reads a file from somewhere—a commit or the index—and writes it to the index and/or your working tree based on the -S (write to staging) and -W (write to work-tree) options.
The git reset -- file command reads a file from Git's index and writes it to your working tree. (The -- here is a precaution, in case the name of the file is, say, master or dev or something that resembles a branch name).
The git add file command reads a file from your working tree and writes it to the index.
(Lots of alternatives are not listed here.)
So all these various commands are tricks for manipulating the index and/or working tree copy, in preparation for making the next commit (since Git is mostly about making new commits, while keeping all the old ones).
1Git actually stores the new compressed blob object immediately, even if it winds up being replaced before you make a new commit. This is okay (if perhaps sub-optimal in certain peculiar situations) because Git will run git gc for you now and then. Certain older Git versions had a bug where git gc didn't get run often enough, and this could actually be a problem, but that's been fixed for years now.
2Using git add -u tells Git to find modified working tree files, and add them, which automates the job. Using git commit -a is a lot like running git add -u && git commit: it runs a git add -u step before the commit. However, -a complicates things a bunch, and interacts badly with poorly-written pre-commit hooks, so it's kind of a bad idea. Try not to rely on it: use git add -u instead, in case you have one of these bad commit hooks. Or, learn to love the index, which lets you play clever tricks like git add -p, although this too interacts badly with poorly-written pre-commit hooks.
How and when Git messes with line endings
If:
Git is told to mess with line endings, and
a file is marked text, so that Git will mess with this file, or the text=auto setting is being used and Git guesses that this file is text
then:
Git will optionally mess with the file's bytes on the way from index to working tree (checkout or switch, restore, various kinds of reset, etc), and
Git will mess with the file's bytes on the way from working tree to index (add, mostly).
What messing-about will Git do? That depends on the eol= setting:
eol=crlf: On the way out, Git will change LF-only to CRLF. If a line reads hello\n in the index, Git will write hello\r\n to the working tree copy. On the way in, Git will change CRLF to LF-only. If a line reads hello\r\n in the working tree copy, Git will write hello\n to the index copy.
eol=lf: On the way out, Git will do nothing to the file. On the way in, Git will change CRLF to LF-only.
That's it—that's all Git will do! It won't ever change LF to CRLF on the way in, for instance. In that sense, we could say that Git "prefers" LF-only line endings. (If you want something fancier, you can write clean and smudge filters, which also operate on data "on the way in" and "on the way out" respectively, and here you can do whatever you like. But the built in stuff inside Git is limited to these few CRLF options.)
There's one more tricky bit: Git tries hard to optimize not making copies, in or out, of the index and working tree. This attempt usually works right, but it fails (by not making copies when it should make copies) if and when you switch around whether and how Git should mess with line endings. The tricks you linked to, where you rm .git/index for instance, are mostly ways to get around this. This forces Git to copy data, even in cases where Git thinks it doesn't need to copy data, even though the changed status of a file (from -text to text, or eol=lf to eol=crlf, or whatever) means that Git does have to copy.
This is all that you need to memorize. The remaining details can be worked out.
Consequences
Suppose you have a repository in which, in every commit that has text files, all committed copies have LF-only line endings. Since this is, in effect, Git's "preferred" format, the files are already all "OK". If you choose to have Git mess with files, all future commits will have LF-only line endings too, and the future commits will match the existing commits.
But suppose you have a repository in which some or all text files are committed with CRLF line endings. These commits are frozen for all time! You literally cannot change them. They will continue to have CRLF line endings. If you now begin choosing to have Git mess with files, future commits will gradually, or suddenly all at once, have some or all files with LF-only line endings, as stored in the repository.
Regardless of which of the above statements about the existing repository are true, your settings, should you set them, will affect how you see the files in your working tree, because to get into your working tree, Git has to extract the files from commits. But your file viewers might not show you what the ends of lines look like. That is, if your preferred file viewer displays a CRLF line and an LF-only line as identical, they'll look identical, even when they aren't.
The fact that the ends of lines "change" can make a change that Git considers a change. If the existing commits in the repository have CRLF line endings, and you start having Git mess with line endings, it's a good idea to do one "normalizing" commit. You will become the owner of every line of every file that is changed this way but git blame, at least, has a way to "skip over" a specific commit, if you need to figure out where some code came from. Since this "fix all files, but no real changes" commit doesn't do anything except normalize these lines, you can tell git blame to skip over it.
Note that Git (and git diff) do consider these lines different, unless you tell git diff to ignore certain white space changes:
--ignore-cr-at-eol: Ignore carriage-return at the end of line when doing a comparison.
-w, --ignore-all-space: Ignore whitespace when comparing lines.
(There are others; this is just a partial list.)
Other items that should be mentioned here
When Git commits a file, it stores both the file's data and its "mode". Git has two modes for files, which it calls 100644 and 100755 when it shows them, but for which git update-index has a --chmod option that it spells -x and +x respectively. This tells Git that on a Unix-like system or any other system that has an equivalent, the 100755 or +x file should be marked executable at checkout.
Most Windows file systems currently don't have an equivalent. In this case, Git tries to retain the chmod setting from the existing checkout. The rm .git/index trick defeats this "retain the old setting" trick. So it's possible to change the mode of files when fixing end-of-line issues. This is why it's better to use git add --renormalize after changing CRLF line endings settings, if your Git supports this.
The general idea that there are some changes, or features of files, that are invisible or hard to see is a little weird, but we have non-computing examples: for instance, in fine typesetting, we have the hyphen (-), the en-dash (–), and the em-dash (—). These may or may not display on your computer as different width dashes. We have other computer examples, such as the Whitespace programming language or the terrible mistake with makefile syntax (where tabs are significant). And, in spycraft—whether or not we use computers—we have steganography.
I am attempting to run 'git rm -rf --cached .' along with 'git add .' to remove cached files that are now listed in the .gitignore. I use Visual Studio on a windows computer, and prefer to leave line endings just as they are for this particular situation.
I tried setting core.autocrlf to false using git config command. I tried creating a .gitattributes with the line '* -text', rm'ing the .git/index, and running git reset. So far, every time I add the files back, I get a huge list of modified files.
EDIT: The change in the files is not actually line endings, it is changes in file permissions which I did not request.
Edit: the remaining problem is that the file modes are apparently not stored properly in Windows systems (see also What is git's "filemode"?). To save and restore them, one will need a script, plus the original data:
git ls-files --stage > /tmp/original
To recover the modes, this rather crude pipeline should work:
< /tmp/original \
awk -F$'\t' '/^100755 / { print "git update-index --chmod=+x \"" $2 "\"" }' |
sh
This will attempt to chmod +x files that have been removed by the below sequence, so you can expect some error messages if there are any such files. (It also assumes no files have double quotes in their names.)
Assuming you do not already have a .gitattributes file, here is a six step process that should work:
Create that .gitattributes file just as you did
Run rm .git/index
Run git checkout HEAD -- .
Run git rm -r --cached .
Run git add .
Run git rm .gitattributes (you can leave this until after verifying that it all worked). Run git commit afterward.
I do not have (nor use) Windows so cannot test this, but here's the theory behind why it should work, and hence why there are these steps.
Git's actual data storage format is a special, Git-only, compressed (sometimes highly compressed) format. Files stored in this format are mainly useful only to Git itself. This format stores a raw, uninterpreted byte stream: files do not have to be separated into "text" and "data" and so on, they are just raw byte streams (hence treated as "data" / "non-text"). The data, once stored, are read-only and get assigned a hash ID (currently SHA-1 though a future Git may use SHA-256). Git calls a file stored this way a blob, which is a term stolen from the database world.
Your computer's useful-file-storage format is of course different, and may (and does on Windows) make a distinction between "text" and "data". Text may have encodings (such as ISO-8859-1, UTF-8, UTF-16, and so on). These files are generally both readable and writable and anything on your computer can deal with them (to some degree anyway, depending on encoding).
Git has to extract files from commits, turning them from blobs into files that you can work with. These files live in your work-tree. You work with them, and then git add them to give Git a chance to re-blob-ize them.
In between these special Git-only blobs and the work-tree, Git needs a place to store the blobbed data, that—unlike a commit—is writable, but that—like a commit—has the file in the special Git-only format. This "in between" place is Git's index. Various bits of Git documentation sometimes call this the staging area or the cache.
Git uses the index copy of each file (or blob, really) to make new commits. When you run git add, Git reads the work-tree file, encodes it down into the blob form, and saves it—well, its hash ID, really—in the index. When you run git commit, Git simply freezes the index copies into committed copies.
When you run git checkout to switch to some commit, Git extracts the commit into the index (filling in all the blob hash IDs), and also extracts the blobs into the work-tree so that they are in useful format and you can work on them. When you run git add, Git compresses the work-tree file into its blob format and replaces the index entry for the file.
Transforming a blob into a work-tree file, or vice versa, is the ideal place where Git will do any conversions you need, such as turning newlines into CRLF line endings. So that's where Git does it: git checkout fills the index and expands-and-converts into the work-tree, and git add compresses-and-un-converts from the work-tree into the index, ready for the next git commit. (Any files you don't touch, stay compressed and ready to go, safely tucked away in the index.)
You already know that a tracked file is one that is in the index, and an untracked file is one that is in the work-tree but not in the index. Your goal is to use the existing .gitignore to make files that are currently in the index go away from the index if they would be .gitignore-ed. The process you are using is:
git rm -r --cached .: remove everything from the index, so that the entire work-tree is untracked
git add .: produce all new blobs in the index from whatever is in the work-tree, while ignoring any file that is listed in .gitignore.
The issue here is that what's in the work-tree has been converted by the "blob to work-tree" conversions, and will be "un-converted" by the "work-tree to blob" conversions. Creating a .gitattributes file with * -text tells Git: The conversions to do are no conversions at all."
Unfortunately, it's too late: the git checkout you ran earlier, to get this commit into the work-tree, already did some conversions.
So here, we use step 1 to create a .gitattributes file that says do no conversions. Step 2, rm .git/index, removes the index entirely. Git now has no idea what's actually in the work-tree. This step may be unnecessary but I use it to force Git to act in step 3, which tells Git: extract every file from the HEAD commit into the index and the work-tree. This re-creates the index, and re-fills the work-tree, this time doing no conversions.
Steps 4 and 5 are just as before, but this time, the work-tree files all match the blobs in the HEAD commit since step 3 operated with the .gitattributes directive in place. Step 6 is to make sure you do not commit the "do no conversions" directive.
I recently discovered that there are a couple folders in my solution that have two distinct paths in Git (GitHub shows two separate folders), one being FooBar and the other being Foobar. This is because some files were registered with the former folder name as their path, and some with the latter.
This was discovered locally (in Windows) by configuring Git to not ignore case: git config core.ignorecase false
I took a stab at fixing this by deleting the whole folder, committing, then re-adding the folder and committing again. This fixed the problem, but the files that got their paths changed lost their Git History. Running gitk against the new path for these files showed just the one commit. Running gitk against their old path revealed their whole history.
Next stab: Use git mv to move the file:
git mv Foobar/file.txt FooBar/file.txt
This yields the error:
fatal: destination exists, source=Foobar/file.txt, destination=FooBar/file.txt
And if I try deleting the file first, of course Git complains that the source file doesn't exist.
Then I discovered Git doesn't complain about the destination already existing if you add -f to the mv command. However, after committing that rename, gitk shows that the history got severed anyway!
I even attempted to do the three step dance described here but this was just another way of doing the -f. Same result.
Basically I just want to move a file from Foobar/file.txt to FooBar/file.txt in a case-insensitive operating system in some way, while preserving Git history. Is this possible?
There is no simple solution to the real problem.
In Git, files don't have history. Commits have history—or more precisely, commits are the history. That is all the history there is. For Git to "follow" a file, as in git log --follow <path>, Git looks at the commits, one at a time, comparing each commit to its parent commit.
If a diff between parent and child shows that the parent contains a file named parent/path/to/pfile and the child contains a file named child/path/to/cfile and the content of these two files, in these two commits, is "sufficiently similar" (several conditions must hold here), then, in Git's "eyes", that parent-to-child transition represents a rename of that file. So at that point, git log --follow, which had been looking for child/path/to/cfile, starts looking instead for parent/path/to/pfile.
Without --follow, git log does not do this special "find a rename" operation ... and in general, Git believes that any path names with any byte-level difference represent different files. In other words, case-folding and UTF-8 normalization do not occur. Consider, e.g., the word schön, which can be represented as either s c h ö n or s c h o combining-¨ n. We can, on a Linux box, create two different files using these two different UTF-8 style names. Running ls will show two files whose name appears the same:
$ cat umlaut.py
import os
p1 = u'sch\N{latin small letter o with diaeresis}n'
p2 = u'scho\N{combining diaeresis}n'
os.close(os.open(p1.encode('utf8'), os.O_CREAT, 0o666))
os.close(os.open(p2.encode('utf8'), os.O_CREAT, 0o666))
$ python umlaut.py
$ ls
schön schön umlaut.py
Git is perfectly happy to store both files, separately. However, MacOS refuses to allow both files to coexist, in the same way that Windows—and for that matter, MacOS by default as well—refuses to allow both Foobar and FooBar to coexist.
Make Git store the file in new commits under the new byte-sequence, and history is preserved, it's just not the history you want preserved. But the history that's already in the repository is already not the history you want preserved.
In practice, you should probably just rename the file in Git's eyes—which has no effect on the file's name in your OS's eyes; FooBar and Foobar are the same name here—and get on with things. Your alternative is to rewrite all history going back in time to the point at which the bad pairings were first added to the repository, by copying (with slight modifications) each "bad" commit to a new-and-improved "good" commit. But this then means getting everyone who uses the repo to switch from "bad old repo" to "new and improved good repo".
I have an XML file that we consider binary in git. This file is externally modified and committed.
I don't care about who edited it and what's new in the file. I just want to have the latest file version at every pull. At this time, at every git pull I have a merge conflict.
I just want that this file is overwritten on every git pull, without manually doing stuff like git fetch/checkout/reset every time I have to sync my repo.
Careful: I want to overwrite just that file, not every file.
Thanks
I thought you could use Git Hooks, but I don't see one running before a pull...
A possible workaround would be to make a script to delete this file and chain with the needed git pull...
This answer shows how to always select the local version for conflicted merges on a specific file. However, midway through the answer, the author describes also how to always use the remote version.
Essentially, you have to use git attributes to specify a specific merge driver for that specific file, with:
echo binaryfile.xml merge=keepTheirs > dir/with/binary/file/.gitattributes
git config merge.keepTheirs.name "always keep their file during merge"
git config merge.keepTheirs.driver "keepTheirs.sh %O %A %B"
git add -A
git commit -m "commit file for git attributes"
and then create keepTheirs.sh in your $PATH:
cp -f "$3" "$2"
exit 0
Please refer to that answer for a detailed explanation.
If the changes to your files are not actual changes, you should not submit them. This will clutter your version history and cause numerous problems.
From your statement I’m not quite sure which is the case, but there are 2 possibilities:
The file in question is a local storage file, the contents of which are not relevant for your actual sourcecode. In this case the file should be part of your .gitignore.
This file is actually part of your source and will thus have relevant changes in the future. By setting up the merge settings like you are planning to do, you will cause trouble once this file actually changes. Because merges will then be destructive.
In this case the solution is a little bit more complicated (apart from getting a fix for the crappy tool that changes stuff it doesn’t actually change …). What you are probably looking for is the assume unchanged functionality of git. You can access it with this command:
git update-index --assume-unchanged <file>
git docu (git help update-index):
You can set "assume unchanged" bit to
paths you have not changed to cause git not to do this check. Note that setting this bit on a path does not mean git will check the
contents of the file to see if it has changed — it makes git to omit any checking and assume it has not changed. When you make changes
to working tree files, you have to explicitly tell git about it by dropping "assume unchanged" bit, either before or after you modify
them.