Does git checkout update all files? - git-checkout

Newb question, I want to make sure I understand this.
When I git checkout <revision>, does this return the entire project to its state at that moment, or does it only recreate the files changed in that particular revision?
For example: If my folder was completely empty besides the .git repo, and I git checkout master, will the resulting files be the project in its entirety, or only the files changed in the most recent commit?
I ask, because I am checking out my project at various points (starting from the beginning), and instead of the project slowly growing in size as one would expect, the size of each checkout is varying quite a lot.

When I git checkout <revision>, does this return the entire project to its state at that moment, or does it only recreate the files changed in that particular revision?
If your working tree and staging area are completely empty (besides the .git subdirectory, of course) and you run
git checkout <revision>
then your working tree and staging area will perfectly reflect the contents of that particular revision.
On the other hand, if your working tree is not empty when you run git checkout, what happens is much more subtle, and may be broken down into three cases:
The checkout is not problematic and Git carries it out without batting an eyelid: the contents of that particular revision get copied to your working tree (and overwrite stuff already present there, if needed). Or
The checkout, if it were carried out, would result in a loss of local changes; therefore, Git (under the assumption that you didn't use the -f flag) tells you off and aborts the checkout. Or
A more complicated situation may arise in which stuff is only partially checked out, and some local, uncommitted changes are kept in your working tree and/or index. More details about that situation can be found in my answer to Why are unstaged changes still present after checking out a different branch?.
[...] the size of each checkout is varying quite a lot.
Are you taking into account untracked files? Did you commit, then later remove large files? On the basis of the information given in your question alone, we can do little more than hypothesize about the reason why the size varies a lot.

From the documentation: "Updates files in the working tree to match the version in the index or the specified tree. " In the case of your example, it will return the repository to the state at the time of the checkout in its entirety.
However as Jubobs pointed out there is a difference in behaviour if you have made any changes to the state of your repository since your last checkout. His answer is more comprehensive than mine if this is the case.
Also note that this will only apply to files that are tracked by git, so any other files you have lying around will not be affected.

Related

What is the optimal way to only sync certain file extensions and exclude other file extensions between separate git branches?

Given 3 branches say master, b1, and b2.
The master branch only cares about *.txt files. It needs to ignore everything else.
Branch b1 only needs what is included by master and say *.h, *.c, *.cpp files, and ignore everything else.
Branch b2 also only needs to include the ones included by master and say *.jpg, *.png, *.html, *.css etc. ignoring everything else.
In short, master branch contains information only common to all branches. Example use case: Branch b1 is used to generate output files to be consumed by branch b2, but both contain some information shared with master.
So, what is the optimal way to sync only those common files between master, b1, and b2, and have each branch to only include certain file extensions in its branch and ignore everything else it doesn't need?
I also looked at alternatives of having separate git repos or submodules or subtrees, but the directory structures or nesting patterns created little difficulties. Is there a better way to solve this problem?
Let me start with this because it's perhaps more useful:
Is there a better way to solve this problem?
You could, in theory, not bother with a master branch at all. Have three branches, none of which holds final-assembled-results. Do the assembly outside of Git. If desired, make an "orphan branch" (or use a tag) to record the assembled result, or just keep the assembled results in a completely different repository. But these all result in those sorts of little difficulties you mention.
What goes wrong
Git simply doesn't work the way you want: you cannot (usefully anyway) "care about" some files and have other "uncared-about" files with the "caring" switching around based on the branch. That's because "branches", in the sense you're using the word, do not exist.
Now, that's a strong statement and it needs justification. Clearly, branches do exist. The problem lies in the meaning of the word branch. It has too many meanings and people just sort of flip between them, without realizing that they're doing this, and that gets them into trouble. (See also What exactly do we mean by "branch"?) So let's just avoid the term by using what Git really uses: commit hash IDs.
When you run:
git checkout br2
you're telling Git to do two things:
save the name br2 for future use;
turn the name br2 into a commit hash ID, and extract—which includes "caring about"—a snapshot of all files from that commit.
The second step is the one that really matters right now: the first one is only needed later, when you run git commit to make a new commit, or some other Git command that needs the name (git branch or git status or git rebase, for instance).
With one exception—which you see in a fresh clone that hasn't yet run git checkout—Git always has some commit checked out right now. Your git checkout tells Git: sweep away the one we have right now, and get me some other commit as the checked-out commit.
Let's say that right now, you have br1 checked out, which is commit b100 right now. Later, the name br1 may mean some other commit, but right now it means that one. You run git checkout br2, which tells Git to switch from commit b100 to commit b200 as that's the one that the name br2 means right now.1
OK, no big deal yet, right? We're moving from commit b100 to commit b200. Commit b100 has in it the *.h files and omits the *.jpg files entirely. So Git "cares about" the *.h files while we have b100 out. Those files are tracked, which means they're in the (single) index. We're moving off b100 though, to b200, which has the *.jpg files and omits the *.h files. Git has to copy the *.jpg files into its index and remove the *.h files from its index, which means it has to remove the *.h files from your work-tree too.
So far, this is all going great: you get just what you want. But now you want to get to master and assemble the pieces. The name master means some other commit, maybe a123 at the moment.
No matter how you get to master, from br1 (b100 at the moment) or br2 (b200) at the moment, you don't have all the *.h and *.jpg files. You can only get one set or another. The underlying problem here is that the "caring about" happens because the files are in Git's index. Listing files in a .gitignore file, which is what you do to keep them from getting into Git's index, only helps if they're not already there—and when you switch to a commit that has the files, Git will put them into Git's index, regardless of what's in a .gitignore file. When you switch to a commit that omits the files, Git will remove them from Git's index, regardless of what's in a .gitignore file.
The index's contents reflect the commit you check out. Each commit has a full snapshot of every file that's in that commit. That snapshot winds up in Git's index. Unless you change them—with git add, or git rm, or by doing another git checkout that replaces them wholesale, for instance—those are the files that will go into the next commit.
Last, when you use git merge to combine work, Git:
finds a merge base commit;
compares the two branch tip commits against this merge base; and
uses that to figure out what to put into the new commit.
The new commit, like any commit, has a snapshot of all the files: all the files that were in Git's index at the time git merge made the merge commit, and those files are the result of the combining process above. Merge commits are the same as any other commit: they have a snapshot and metadata. The only thing that makes them special—makes them merge commits—is that they have two (or more) parent commit hash IDs listed in their metadata.
These interlocking behaviors get in the way: Either master actually does have all the files, in which case, the other commits found by other branch names also need to have all the files, or master doesn't have any of the files, in which case the other branches can be exclusive like this but you can't merge them back into master, because the common commit that Git will find, that will act as the merge base, will cause them to add the files to the new commit that goes into master—and now master has all the files! If you remove them as you go back into the branches, merging will remove the files this time.
Ultimately, Git is all about commits. It's the commits that determine, well, everything! The commits are snapshots-plus-metadata. All a branch name does is find one particular commit: the last one on some chain. Commits can be reached from more than one branch name, and many, or most, commits are on multiple branches simultaneously. So the name has nothing to do with which files are in the commit: it literally can't when more than one name finds that commit.
1Branch name to commit hash ID mappings change, which is how branches grow in Git. Git is built to add new commits, so the normal way that a name changes is that it now means a newer commit that leads, via the commit graph, back to the old commit—and many more commits too. See also Think Like (a) Git.

Getting error setting up a git repo on a shared drive

I would like to set up a git repository for several personal projects. I guess I'll do one repo per project; the projects are for different programs and libraries, not different parts of one program or system.
I'm on Windows 10 Pro, with a network drive I call Y. I'll put these repos in a directory cleverly named git. I found instructions to do this from the DOS command line:
Y:
cd \git\
git init --bare myrepo.git
Then, from the place where the code is I am going to want to put in the repository:
C:
cd \files\programming\workspaces\project1
git remote add origin y:\git\myrepo.git
When I execute the last line, I get the message:
fatal: not a git repository (or any of the parent directories): .git
I have tried it with slashes and backslashes; I have deleted the directory I created with the git command and done it over (in case I'd forgotten something), but still get the same result. What am I doing wrong?
I'll put this in as a separate answer, in terms of setting up a second (and --bare) repository. You've already done this on Y:. You mention that this is a shared / network-accessible drive; that's sometimes OK, depending on how things work with lock files across shared drives. There are OS-specific, file-system-specific, and network-specific details to be considered. Let's just assume that they're OK, and/or that you and co-workers will be careful about who writes to the Y: drive. Anyway, having done:
cd Y:\git\
git init --bare myrepo.git
you now have a second repository (made first, but the order of creation doesn't really matter). Meanwhile you now do:
cd C:\files\programming\workspaces\project1
git init
which creates .git within C:\files\programming\workspaces\project1. You can now add a remote, which is just a name for another Git repository in which you store the URL of the other Git repository:
git remote add origin file://Y:/git/myrepo.git
Now that there is a .git (repository) in C:\files\programming\workspaces\project1, this git remote add should work. The name origin is simply conventional: you can use any name you like, but origin is the one that git clone creates when you use git clone to make the local (C: area) repository.
Git itself does not mind if the URL has backslashes, but some other software might, so it's probably wisest to stick with the normal slashes here.
You're now ready to have your local repository—the one in C:\files\programming\workspaces\project1\.git—call up a second Git over at file://Y:/.... But there's no point connecting either of these two repositories until at least one of them has some commits, because commits are the main currency of exchange between two Git repositories. So, any time before or after the git remote add, you'll want to make at least one initial commit, which will cause the branch name master to spring into existence.
Once you have a commit, or some commits, you can run:
git push origin master
The git push command connects your Git—the repository you're in right now—to some other Git. The name origin here is the name of the remote, as set up by git clone normally, but by your git remote add in this particular case. The last argument, master, is what Git calls a refspec.
Refspecs are a bit complicated. They have two parts, separated by a colon : character. This particular refspec, master, is missing the colon. For git push, what that means is: use the same name on both sides. So this is really shorthand for:
git push origin master:master
The name on the left of the colon is usually a branch name in your repository. The name on the right side of the colon is the corresponding branch name in their repository—the other Git.
Your Git will now invoke another Git, often over a network connection (ssh:// or git:// or http:// or https:// URLs). In this case, your Git works with the second Git repository directly, but you can just think of it as spinning off a second Git command that acts as the receiver for the other repository (because it actually does that).
Your Git now offers commits to that other Git, by their hash IDs. They inspect the hash ID and check in their database of commit-and-other-Git-objects to see if they have those hash IDs. Your Git wants to know: should I give you, other Git, this one? Or do you already have it? At this point, their database is empty, so the answer is always I don't have that one - gimme! Your Git now packages these up, compressing them for network transfer, and sends them over. They install these commits in their database.
At the end of this process, your Git sends their Git a polite request of the form: Please, if it's OK, set your master now to point to the same commit that my master points-to. If they accept—and here, they will—they now have their master pointing to the last commit in the chain of commits that ends with the commit that is in your repository, that your Git finds via your name master.
If this sounds complicated, it's because it is: you and they don't share branch names at all, unless you and they want to. But you and they do share commit objects with their unique hash IDs.
Fetching (and cloning) is not quite symmetric, but I'm out of time now for this answer.
Apparently C:\files\programming\workspaces\project1 is not a Git repository.
It's true that Y:\git\myrepo.git is a Git repository (assuming the earlier git init worked). But C:\files\programming\workspaces\project1 is not. You'll need to create a second Git repository there.1 You could git clone the empty repository over in Y:\git\myrepo.git, for instance (although cloning a totally empty repository has some weird side effects and is usually not the right way to start).
The way Git works in general is:
You clone an entire existing repository: every commit, which saves every version of every file, is now copied into the .git directory here.
Git is really all about these commits. Each commit has a full and complete snapshot of every file, frozen into a special, read-only, Git-only format, along with some additional information such as who made the commit, when, and why. These commits act as archives: every time someone ran git commit, Git archived everything.2 You now have a copy of everything, in this special achival format that only Git can use.
Each commit has its own unique hash ID. No two commits can ever share an ID. For this reason, and the fact that every Git in the world has to give every commit a unique hash ID, these IDs have to be very big and ugly and random-looking. All Gits everywhere share hash IDs for all their commits, so that they can share their commits by ID, later. You can connect any two Gits to each other and they can obtain each other's commits just by using these hash IDs.
Because these hash IDs are so big and ugly, humans can't get them right. Fortunately we don't have to; we'll see this in a bit.
Then—this is actually built in, as the last step of the git clone you just did—you have Git select some commit. That selected commit becomes the current commit.
You usually select the commit by selecting a branch name, in which case that name becomes the current branch name as well, i.e., these two actions are paired up: selecting branch master as the current branch chooses the last commit in that branch as the current commit.
(There's an extra complication here when doing that initial git clone, again, but we'll skip it for now.)
Git now extracts all the files from the chosen commit, into a work area. In this work area, you have ordinary files that you can use with all the ordinary programs on your computer. These are your files, to work with as you wish: they're not Git's copies at all. Git just extracted everything from a commit, in order to make these copies available.
Once you have all the working copies of files stored in this work area, which we call a working tree or work-tree, you can work with them. That's why it's your work-tree: because you can actually get some work done.
Having worked on a bunch of files, you might want to save a new saved-for-all-time archival snapshot. You might think you could just run git commit and Git would save all your files. Other version control systems work this way, but Git does not. Git has, in a secret file,3 saved away all the files that came out of the current commit. Those files are in the special Git-only frozen format, but unlike the copies of the files that are in commits, they're not actually frozen. You can replace them, or remove some file(s) entirely, or put new ones in.
Git calls this special extra area the index, or the staging area, or sometimes—rarely these days—the cache. These three names for this one thing reflect its central and multiple roles in Git, or perhaps there are three names because the original name, "index", is just so terrible. :-) But either way, you need to know about the index.
Essentially—and leaving out some of its other roles—what the index is and does is represent the next commit you will make. That is, it contains, in a special Git-specific format, some information about your current work-tree, but more importantly, a copy of each file that will go into the next commit.4
Having updated your work-tree files, you need to copy those files back into Git's index, which you do with git add:
git add file1 folder/file2
for instance.5 This copies these two files from your work-tree into the index, turning the copies into the special Git-only format, ready to go into the next commit. In other words, these files are now staged for commit, hence the other name of the index, "staging area". They're not actually committed yet but they are ready to go. (They were ready to go before, too, but before, they matched the current commit's copy!)
At this point, running git commit makes a new commit from whatever files are in the index right now. This new commit gets your name and email address as both author and committer, and "now"—the current date-and-time reported by your OS—as the time-stamps. You should supply a log message giving the reason you made the commit: a summary of why you are doing whatever you are doing.6
The git commit command packages up all of this information—who, when, why, and so on—along with the raw hash ID of the current commit, and makes a new commit out of this plus the snapshot it makes using the files that are already in the right format in Git's index. Now that the new commit is made, it becomes the current commit. Now things are back to the way they were when you ran git checkout: the index and the current commit both contain the same set of files, in the frozen archive format, ready to go into a new commit.
Note that no existing commit changes during all of this. In fact, no existing Git commit can ever change. All commits are frozen for all time, read-only. They continue to exist as long as you and Git can find them—usually forever, but you can arrange to "lose" one, if you've made one that you don't like.
The way Git finds commits is important, and a little tricky. Once you get the hang of it, though, it's actually really simple.
1This may actually be the only repository you need: it's not clear why you wanted an empty and bare one in Y:\git\myrepo.git in the first place.
2More precisely, Git archived everything it was told to archive, as we'll see in a moment.
3It isn't really secret at all, but you can't see it very well: it's hidden in a specially-formatted file in .git named index (and maybe other places too, but they all start from the index file; the index file contains records, and some of them might list more files).
4Technically, what's in the index, in these cache entries, is the file's path-name, mode, and an internal Git blob hash ID. There's also a staging slot number which is really only used for merging. The hash ID means that rather than holding an actual copy of each file, the index just holds the record of the Git-formatted blob object. But unless you start using git ls-files --stage and git update-index directly, you don't really need to know about this: you can just think of the index as holding a copy of each file.
5You can use either forward slash like this, or backslash; both work. I don't use Windows, and always use forward slash, and the few times I have been forced to use Windows briefly, I always name my files there with forward slashes. (This mostly works, except for a few commands that insist on thinking they're switch options. When dealing with Git and its ecosystem, backslash tends to confuse some other programs: \b, for instance, may represent a backspace, and \t a tab, so an attempt to name a file .\buttons\toggle can misfire and you end up with a file named .^Huttons^Toggle or something.)
6Git can easily show what you did, later, but Git has no idea that this was, e.g., to fix bug#12345 or issue#97 or whatever it might be, much less how the bug or issue could be described. This log message is your opportunity to explain things like what the bug is, where to find it in the bug reporting system, what you discovered during investigation of the bug, and anything else that might be helpful to you, or someone else, looking at this commit later.
Branch names let Git find commits for you
A branch name like master, in Git, really just holds one hash ID.
That's all it needs to do. We mentioned before that whenever you have Git make a new commit, the new commit saves the raw hash ID of the current commit.
Suppose you have an existing Git repository with just one commit in it. This one commit has some big ugly hash ID, but we'll just call it A for short:
A
There's only the one commit in the repository. That one commit has however many files, but it's just one commit. It's easy to find: it's the commit. Let's add a second commit now, by having this commit checked out via the name master—we'll put the name in, in just a moment.
We modify some work-tree files, git add them, and run git commit and give it a reason for the commit to put in the log message. Git builds a new commit out of all of the files in the index, plus the usual metadata, including the hash ID of commit A. Let's call the new commit B, and draw it now:
A <-B
B contains the old commit's hash ID. We say that B points to A.
Git writes the new commit's hash ID into the name master, so let's draw the name master pointing to B now:
A--B <-- master
I've already gotten lazy here (on purpose): it's B that points to A, not vice versa. But the arrow coming out of B cannot change, because no part of any commit can change. It's the arrow coming out of master that changes. We call commit A the parent of B, and B a child of A.
The current branch is now master and the current commit is B. Let's make a new commit in the usual way:
A--B--C <-- master
New commit C points back to B, which points back to A. So B may be a child of A, but it's also the parent of C.
(Where does A point? The answer is: nowhere. Commit A is a little bit special. Being the very first commit, it can't point back to any earlier commit. So it just doesn't. Creating the first commit in a repository is a bit of a special act; it's what creates the branch name, too! The name master is not allowed to exist until some commit exists, so creating commit A creates everything.)
(I keep saying a child, not the child. That's because we can go back and add more children later. Commits, once made, are frozen for all time, so the children know exactly who their parents are, but parents can acquire new children, someday, in the future. When a new commit is made, it never has any children yet. So parents never know who their children are. That's why Git works backwards!)
Note how all we need is for Git to hold the raw (and random-looking) hash ID of the last commit in the name master. We can remember the name master, and Git remembers the hash ID for us. Adding a new commit consists of:
making sure the current branch name is master (git checkout master if needed)
so that the current commit is C
so that Git's index is full of the right copies of files, and our work-tree has the files we want
so that we can change work-tree files in place using all of the normal computer tools
so that we can git add the updated files to make Git copy them back into the index
so that we can git commit to make a new commit D
which will change our picture to read:
A--B--C--D <-- master
All of these new commits go into our repository. The repository itself is mainly just two big databases:
the commits, and other internal Git objects, addressed by hash IDs;
and a smaller name-to-hash-ID table, that says things like branch name master means commit a123456... or whatever.
The entire repository is in the .git directory / folder, underneath the top level of our work-tree. The branch name(s) find the last commits, and those commits find earlier commits. Git simply walks backwards, from last commit back to first one, one commit at a time. Git knows that it has run out of commits to walk backwards through when it reaches a root commit like commit A, that has no parent.
There is a lot more to it than this, starting with the fact that you can add more branch names:
A--B--C--D <-- master, dev
for instance, and move branch names around, and so on—and we haven't even touched on the idea of connecting this Git repository, in C:\files\programming\workspaces\project1, to another Git repository in Y:\git\myrepo.git or on another machine or whatever yet. That's where things get complicated. That's what git remote is for: a remote is a name you use in your Git to remember the URL for some other Git repository.
If you don't need to use remotes yet, don't do that; this is plenty to start with.

List specific git branches depending on a filter

I am working on a huge repo with a lot of branches in it. Most of them are already merged to master and some of them are waiting to be merged.
The need is for me to see which of them are touching into a specific directory.
For example, in this repo there are several APIs, let's say A, B, C, etc.
I would like to see which of the branches of this repo are trying to change A. That's why I mentioned a 'specific directory' above, but if there is an easier way to check, it's also acceptable.
I am not sure which way would be convenient to do that? Choosing to develop a script or are there any git native commands?
(This is not a complete answer. Just trying to bring some elements to the conversation.)
For a given branch, it would be quite trivial with something like
git rev-list --count <referenceBranch>..<featureBranch> -- path/to/directory/*
which would ouput a positive value only if the branch has touched the given directory since last reference state.
Now to loop over branches and execute the above test, I first thought about for-each-ref which, as names indicates, cycles through each ref in a given refset (here it would be refs/heads), but I don't yet see how to make these work together. probably just a loop in bash?
Interesting question indeed.

Can Git subtree remember or track prefixes automatically?

We have git remote add origin http://... to avoid repeating typing of actual source repo path. But how about git subtree --prefix=...? It is hard to track, remember and unstable typing prefix path for each time when I pull/push subtree content.
Is there any built-in feature to track prefix path automatically?
There was a contrib done to address writing to a config file that was not part of the original contrib that added subtrees.
Here's a blog about it: Blog about git subtree (with config)
And here's where it was contributed on github.
I recommend if you get that branch that you merge the latest from the main github subtree contrib.
In general, I think this is a good approach.
Subtrees are still evolving, and this is one of the missing links.
I'd like to also see the last commit id being recorded this way and deprecate the old way of using --rejoin to detect where to start the next split from.

Performance efficient Git history

Is there a way to efficiently read previous versions of files in Git? (I'm using Git as a database for Content Management System, and need it do display history).
Git doesn't store full versions of files, it only store differences, so, if You need previous version - You can't just read it from disk, You should ask Git to calculate it using differences.
It seems that GitHub somehow does that, for example, You can see previous version of file. Does it really calculates it for every HTTP request or somehow optimizes it?
Erratum: git ALWAYS stores full versions of files. Thus viewing any revision is equally efficient.
This is in marked contrast to some other revision systems which only store diffs (patches). Cvs in particular was hideous for accessing deep history or non-trunk branches for this very reason (for a large repository with many users).
For reference, To conveniently access a particular file at a particular version (sha/reference):
git show HEAD:full/path/to/file
Replace head with a tag, branch or git sha (the long hex number)
The path is the full path relative to the base of the git repository-not the file system root. I only mention this because it has bitten me a few times-you can't cd into a directory and expect to not specify the full path.
Wikipedia (home of all that is true and good) backs me up:
Git stores each revision of a file as a unique blob object. The relationships between the blobs can be found through examining the tree and commit objects. Newly added objects are stored in their entirety using zlib compression.
In case Wikipedia isn't your bag, a careful reading of the git internals manual also verifies it.

Resources