Performance efficient Git history - performance

Is there a way to efficiently read previous versions of files in Git? (I'm using Git as a database for Content Management System, and need it do display history).
Git doesn't store full versions of files, it only store differences, so, if You need previous version - You can't just read it from disk, You should ask Git to calculate it using differences.
It seems that GitHub somehow does that, for example, You can see previous version of file. Does it really calculates it for every HTTP request or somehow optimizes it?

Erratum: git ALWAYS stores full versions of files. Thus viewing any revision is equally efficient.
This is in marked contrast to some other revision systems which only store diffs (patches). Cvs in particular was hideous for accessing deep history or non-trunk branches for this very reason (for a large repository with many users).
For reference, To conveniently access a particular file at a particular version (sha/reference):
git show HEAD:full/path/to/file
Replace head with a tag, branch or git sha (the long hex number)
The path is the full path relative to the base of the git repository-not the file system root. I only mention this because it has bitten me a few times-you can't cd into a directory and expect to not specify the full path.
Wikipedia (home of all that is true and good) backs me up:
Git stores each revision of a file as a unique blob object. The relationships between the blobs can be found through examining the tree and commit objects. Newly added objects are stored in their entirety using zlib compression.
In case Wikipedia isn't your bag, a careful reading of the git internals manual also verifies it.

Related

Getting error setting up a git repo on a shared drive

I would like to set up a git repository for several personal projects. I guess I'll do one repo per project; the projects are for different programs and libraries, not different parts of one program or system.
I'm on Windows 10 Pro, with a network drive I call Y. I'll put these repos in a directory cleverly named git. I found instructions to do this from the DOS command line:
Y:
cd \git\
git init --bare myrepo.git
Then, from the place where the code is I am going to want to put in the repository:
C:
cd \files\programming\workspaces\project1
git remote add origin y:\git\myrepo.git
When I execute the last line, I get the message:
fatal: not a git repository (or any of the parent directories): .git
I have tried it with slashes and backslashes; I have deleted the directory I created with the git command and done it over (in case I'd forgotten something), but still get the same result. What am I doing wrong?
I'll put this in as a separate answer, in terms of setting up a second (and --bare) repository. You've already done this on Y:. You mention that this is a shared / network-accessible drive; that's sometimes OK, depending on how things work with lock files across shared drives. There are OS-specific, file-system-specific, and network-specific details to be considered. Let's just assume that they're OK, and/or that you and co-workers will be careful about who writes to the Y: drive. Anyway, having done:
cd Y:\git\
git init --bare myrepo.git
you now have a second repository (made first, but the order of creation doesn't really matter). Meanwhile you now do:
cd C:\files\programming\workspaces\project1
git init
which creates .git within C:\files\programming\workspaces\project1. You can now add a remote, which is just a name for another Git repository in which you store the URL of the other Git repository:
git remote add origin file://Y:/git/myrepo.git
Now that there is a .git (repository) in C:\files\programming\workspaces\project1, this git remote add should work. The name origin is simply conventional: you can use any name you like, but origin is the one that git clone creates when you use git clone to make the local (C: area) repository.
Git itself does not mind if the URL has backslashes, but some other software might, so it's probably wisest to stick with the normal slashes here.
You're now ready to have your local repository—the one in C:\files\programming\workspaces\project1\.git—call up a second Git over at file://Y:/.... But there's no point connecting either of these two repositories until at least one of them has some commits, because commits are the main currency of exchange between two Git repositories. So, any time before or after the git remote add, you'll want to make at least one initial commit, which will cause the branch name master to spring into existence.
Once you have a commit, or some commits, you can run:
git push origin master
The git push command connects your Git—the repository you're in right now—to some other Git. The name origin here is the name of the remote, as set up by git clone normally, but by your git remote add in this particular case. The last argument, master, is what Git calls a refspec.
Refspecs are a bit complicated. They have two parts, separated by a colon : character. This particular refspec, master, is missing the colon. For git push, what that means is: use the same name on both sides. So this is really shorthand for:
git push origin master:master
The name on the left of the colon is usually a branch name in your repository. The name on the right side of the colon is the corresponding branch name in their repository—the other Git.
Your Git will now invoke another Git, often over a network connection (ssh:// or git:// or http:// or https:// URLs). In this case, your Git works with the second Git repository directly, but you can just think of it as spinning off a second Git command that acts as the receiver for the other repository (because it actually does that).
Your Git now offers commits to that other Git, by their hash IDs. They inspect the hash ID and check in their database of commit-and-other-Git-objects to see if they have those hash IDs. Your Git wants to know: should I give you, other Git, this one? Or do you already have it? At this point, their database is empty, so the answer is always I don't have that one - gimme! Your Git now packages these up, compressing them for network transfer, and sends them over. They install these commits in their database.
At the end of this process, your Git sends their Git a polite request of the form: Please, if it's OK, set your master now to point to the same commit that my master points-to. If they accept—and here, they will—they now have their master pointing to the last commit in the chain of commits that ends with the commit that is in your repository, that your Git finds via your name master.
If this sounds complicated, it's because it is: you and they don't share branch names at all, unless you and they want to. But you and they do share commit objects with their unique hash IDs.
Fetching (and cloning) is not quite symmetric, but I'm out of time now for this answer.
Apparently C:\files\programming\workspaces\project1 is not a Git repository.
It's true that Y:\git\myrepo.git is a Git repository (assuming the earlier git init worked). But C:\files\programming\workspaces\project1 is not. You'll need to create a second Git repository there.1 You could git clone the empty repository over in Y:\git\myrepo.git, for instance (although cloning a totally empty repository has some weird side effects and is usually not the right way to start).
The way Git works in general is:
You clone an entire existing repository: every commit, which saves every version of every file, is now copied into the .git directory here.
Git is really all about these commits. Each commit has a full and complete snapshot of every file, frozen into a special, read-only, Git-only format, along with some additional information such as who made the commit, when, and why. These commits act as archives: every time someone ran git commit, Git archived everything.2 You now have a copy of everything, in this special achival format that only Git can use.
Each commit has its own unique hash ID. No two commits can ever share an ID. For this reason, and the fact that every Git in the world has to give every commit a unique hash ID, these IDs have to be very big and ugly and random-looking. All Gits everywhere share hash IDs for all their commits, so that they can share their commits by ID, later. You can connect any two Gits to each other and they can obtain each other's commits just by using these hash IDs.
Because these hash IDs are so big and ugly, humans can't get them right. Fortunately we don't have to; we'll see this in a bit.
Then—this is actually built in, as the last step of the git clone you just did—you have Git select some commit. That selected commit becomes the current commit.
You usually select the commit by selecting a branch name, in which case that name becomes the current branch name as well, i.e., these two actions are paired up: selecting branch master as the current branch chooses the last commit in that branch as the current commit.
(There's an extra complication here when doing that initial git clone, again, but we'll skip it for now.)
Git now extracts all the files from the chosen commit, into a work area. In this work area, you have ordinary files that you can use with all the ordinary programs on your computer. These are your files, to work with as you wish: they're not Git's copies at all. Git just extracted everything from a commit, in order to make these copies available.
Once you have all the working copies of files stored in this work area, which we call a working tree or work-tree, you can work with them. That's why it's your work-tree: because you can actually get some work done.
Having worked on a bunch of files, you might want to save a new saved-for-all-time archival snapshot. You might think you could just run git commit and Git would save all your files. Other version control systems work this way, but Git does not. Git has, in a secret file,3 saved away all the files that came out of the current commit. Those files are in the special Git-only frozen format, but unlike the copies of the files that are in commits, they're not actually frozen. You can replace them, or remove some file(s) entirely, or put new ones in.
Git calls this special extra area the index, or the staging area, or sometimes—rarely these days—the cache. These three names for this one thing reflect its central and multiple roles in Git, or perhaps there are three names because the original name, "index", is just so terrible. :-) But either way, you need to know about the index.
Essentially—and leaving out some of its other roles—what the index is and does is represent the next commit you will make. That is, it contains, in a special Git-specific format, some information about your current work-tree, but more importantly, a copy of each file that will go into the next commit.4
Having updated your work-tree files, you need to copy those files back into Git's index, which you do with git add:
git add file1 folder/file2
for instance.5 This copies these two files from your work-tree into the index, turning the copies into the special Git-only format, ready to go into the next commit. In other words, these files are now staged for commit, hence the other name of the index, "staging area". They're not actually committed yet but they are ready to go. (They were ready to go before, too, but before, they matched the current commit's copy!)
At this point, running git commit makes a new commit from whatever files are in the index right now. This new commit gets your name and email address as both author and committer, and "now"—the current date-and-time reported by your OS—as the time-stamps. You should supply a log message giving the reason you made the commit: a summary of why you are doing whatever you are doing.6
The git commit command packages up all of this information—who, when, why, and so on—along with the raw hash ID of the current commit, and makes a new commit out of this plus the snapshot it makes using the files that are already in the right format in Git's index. Now that the new commit is made, it becomes the current commit. Now things are back to the way they were when you ran git checkout: the index and the current commit both contain the same set of files, in the frozen archive format, ready to go into a new commit.
Note that no existing commit changes during all of this. In fact, no existing Git commit can ever change. All commits are frozen for all time, read-only. They continue to exist as long as you and Git can find them—usually forever, but you can arrange to "lose" one, if you've made one that you don't like.
The way Git finds commits is important, and a little tricky. Once you get the hang of it, though, it's actually really simple.
1This may actually be the only repository you need: it's not clear why you wanted an empty and bare one in Y:\git\myrepo.git in the first place.
2More precisely, Git archived everything it was told to archive, as we'll see in a moment.
3It isn't really secret at all, but you can't see it very well: it's hidden in a specially-formatted file in .git named index (and maybe other places too, but they all start from the index file; the index file contains records, and some of them might list more files).
4Technically, what's in the index, in these cache entries, is the file's path-name, mode, and an internal Git blob hash ID. There's also a staging slot number which is really only used for merging. The hash ID means that rather than holding an actual copy of each file, the index just holds the record of the Git-formatted blob object. But unless you start using git ls-files --stage and git update-index directly, you don't really need to know about this: you can just think of the index as holding a copy of each file.
5You can use either forward slash like this, or backslash; both work. I don't use Windows, and always use forward slash, and the few times I have been forced to use Windows briefly, I always name my files there with forward slashes. (This mostly works, except for a few commands that insist on thinking they're switch options. When dealing with Git and its ecosystem, backslash tends to confuse some other programs: \b, for instance, may represent a backspace, and \t a tab, so an attempt to name a file .\buttons\toggle can misfire and you end up with a file named .^Huttons^Toggle or something.)
6Git can easily show what you did, later, but Git has no idea that this was, e.g., to fix bug#12345 or issue#97 or whatever it might be, much less how the bug or issue could be described. This log message is your opportunity to explain things like what the bug is, where to find it in the bug reporting system, what you discovered during investigation of the bug, and anything else that might be helpful to you, or someone else, looking at this commit later.
Branch names let Git find commits for you
A branch name like master, in Git, really just holds one hash ID.
That's all it needs to do. We mentioned before that whenever you have Git make a new commit, the new commit saves the raw hash ID of the current commit.
Suppose you have an existing Git repository with just one commit in it. This one commit has some big ugly hash ID, but we'll just call it A for short:
A
There's only the one commit in the repository. That one commit has however many files, but it's just one commit. It's easy to find: it's the commit. Let's add a second commit now, by having this commit checked out via the name master—we'll put the name in, in just a moment.
We modify some work-tree files, git add them, and run git commit and give it a reason for the commit to put in the log message. Git builds a new commit out of all of the files in the index, plus the usual metadata, including the hash ID of commit A. Let's call the new commit B, and draw it now:
A <-B
B contains the old commit's hash ID. We say that B points to A.
Git writes the new commit's hash ID into the name master, so let's draw the name master pointing to B now:
A--B <-- master
I've already gotten lazy here (on purpose): it's B that points to A, not vice versa. But the arrow coming out of B cannot change, because no part of any commit can change. It's the arrow coming out of master that changes. We call commit A the parent of B, and B a child of A.
The current branch is now master and the current commit is B. Let's make a new commit in the usual way:
A--B--C <-- master
New commit C points back to B, which points back to A. So B may be a child of A, but it's also the parent of C.
(Where does A point? The answer is: nowhere. Commit A is a little bit special. Being the very first commit, it can't point back to any earlier commit. So it just doesn't. Creating the first commit in a repository is a bit of a special act; it's what creates the branch name, too! The name master is not allowed to exist until some commit exists, so creating commit A creates everything.)
(I keep saying a child, not the child. That's because we can go back and add more children later. Commits, once made, are frozen for all time, so the children know exactly who their parents are, but parents can acquire new children, someday, in the future. When a new commit is made, it never has any children yet. So parents never know who their children are. That's why Git works backwards!)
Note how all we need is for Git to hold the raw (and random-looking) hash ID of the last commit in the name master. We can remember the name master, and Git remembers the hash ID for us. Adding a new commit consists of:
making sure the current branch name is master (git checkout master if needed)
so that the current commit is C
so that Git's index is full of the right copies of files, and our work-tree has the files we want
so that we can change work-tree files in place using all of the normal computer tools
so that we can git add the updated files to make Git copy them back into the index
so that we can git commit to make a new commit D
which will change our picture to read:
A--B--C--D <-- master
All of these new commits go into our repository. The repository itself is mainly just two big databases:
the commits, and other internal Git objects, addressed by hash IDs;
and a smaller name-to-hash-ID table, that says things like branch name master means commit a123456... or whatever.
The entire repository is in the .git directory / folder, underneath the top level of our work-tree. The branch name(s) find the last commits, and those commits find earlier commits. Git simply walks backwards, from last commit back to first one, one commit at a time. Git knows that it has run out of commits to walk backwards through when it reaches a root commit like commit A, that has no parent.
There is a lot more to it than this, starting with the fact that you can add more branch names:
A--B--C--D <-- master, dev
for instance, and move branch names around, and so on—and we haven't even touched on the idea of connecting this Git repository, in C:\files\programming\workspaces\project1, to another Git repository in Y:\git\myrepo.git or on another machine or whatever yet. That's where things get complicated. That's what git remote is for: a remote is a name you use in your Git to remember the URL for some other Git repository.
If you don't need to use remotes yet, don't do that; this is plenty to start with.

Can I remove all duplicates (not only consecutive) to put my histories (.bash_history, .gdb_history) under version control?

I have this mantra:
If it's plain text and it is valuable, put it under version control.
So far I have the following under git:
My editor (Emacs/Spacemacs) configuration.
My bash configuration
Various todo lists
I have created a repo to store my histories, but I have come across the issue of having duplicates in them.
To me it is pretty disappointing that HISTCONTROL=erasedups only deals with consecutive duplicates.
Would it be possible to create a hook that is executed every time I enter a new command to remove duplicates in the histories?
Or should it be a pre push git hook each time I push to the repo?

Does git checkout update all files?

Newb question, I want to make sure I understand this.
When I git checkout <revision>, does this return the entire project to its state at that moment, or does it only recreate the files changed in that particular revision?
For example: If my folder was completely empty besides the .git repo, and I git checkout master, will the resulting files be the project in its entirety, or only the files changed in the most recent commit?
I ask, because I am checking out my project at various points (starting from the beginning), and instead of the project slowly growing in size as one would expect, the size of each checkout is varying quite a lot.
When I git checkout <revision>, does this return the entire project to its state at that moment, or does it only recreate the files changed in that particular revision?
If your working tree and staging area are completely empty (besides the .git subdirectory, of course) and you run
git checkout <revision>
then your working tree and staging area will perfectly reflect the contents of that particular revision.
On the other hand, if your working tree is not empty when you run git checkout, what happens is much more subtle, and may be broken down into three cases:
The checkout is not problematic and Git carries it out without batting an eyelid: the contents of that particular revision get copied to your working tree (and overwrite stuff already present there, if needed). Or
The checkout, if it were carried out, would result in a loss of local changes; therefore, Git (under the assumption that you didn't use the -f flag) tells you off and aborts the checkout. Or
A more complicated situation may arise in which stuff is only partially checked out, and some local, uncommitted changes are kept in your working tree and/or index. More details about that situation can be found in my answer to Why are unstaged changes still present after checking out a different branch?.
[...] the size of each checkout is varying quite a lot.
Are you taking into account untracked files? Did you commit, then later remove large files? On the basis of the information given in your question alone, we can do little more than hypothesize about the reason why the size varies a lot.
From the documentation: "Updates files in the working tree to match the version in the index or the specified tree. " In the case of your example, it will return the repository to the state at the time of the checkout in its entirety.
However as Jubobs pointed out there is a difference in behaviour if you have made any changes to the state of your repository since your last checkout. His answer is more comprehensive than mine if this is the case.
Also note that this will only apply to files that are tracked by git, so any other files you have lying around will not be affected.

Can Git subtree remember or track prefixes automatically?

We have git remote add origin http://... to avoid repeating typing of actual source repo path. But how about git subtree --prefix=...? It is hard to track, remember and unstable typing prefix path for each time when I pull/push subtree content.
Is there any built-in feature to track prefix path automatically?
There was a contrib done to address writing to a config file that was not part of the original contrib that added subtrees.
Here's a blog about it: Blog about git subtree (with config)
And here's where it was contributed on github.
I recommend if you get that branch that you merge the latest from the main github subtree contrib.
In general, I think this is a good approach.
Subtrees are still evolving, and this is one of the missing links.
I'd like to also see the last commit id being recorded this way and deprecate the old way of using --rejoin to detect where to start the next split from.

strategies for backing up packages on macosx

I am writing a program that synchronizes files across file systems much like rsync but I'm stuck when it comes to handling packages. These are folders that are identified by the system as containing a coherent set of files. Pages and Numbers can use packages rather than monolithic files, and applications are actually packages for example. My problem is that I want to keep the most recent version and also keep a backup copy. As far as I can see I have two options -
I can just treat the whole thing as a regular folder and handle the contents entry by entry.
I can look at all the modification dates of all the contents and keep the complete folder tree for the one that has the most recently modified contents.
I was going for (2) and then I found that the iPhoto library is actually stored as a package and that would mean I would copy the whole library (10s, or even 100s of gigabytes) even if only one photograph was altered.
My worry with (1) is that handling the content files individually might break things. I haven't really come up with a good solution that will guarantee that the package will work and won't involved unnecessarily huge backup files in some cases. If it is just iPhoto then I can probably put in a special case, or perhaps change strategy if the package is bigger than some user specified limit.
Packages are surprisingly mysterious, and what the system treats as a package does not seem to be just a matter of setting an extended attribute on a folder.
It depends on how you treat the "backup" version. Do you keep two versions of each file (the current and first previous), or two versions of the sync snapshot (i.e. if a file hasn't changed between the last two syncs, you only store one version)?
If it's two versions of the sync, packages shouldn't be a big problem -- just provide a way to restore the "backup" version, which if necessary splices together the changed files from the "backup" with the unchanged files from the current sync. There are some things to watch out for, though: make sure you correctly handle files that're deleted or added between the two snapshots.
If you're storing two versions of each file, things are much more complicated -- you need some way to record which versions of the files within the package "go together". I think in this case I'd be tempted to only store backup versions of files within the package from the last time something within the package changed. So, for example, say you sync a package called preso.key. On the second sync, preso.key/index.apxl.gz and preso.key/splash.png are modified, so the old version of those two files get stored in the backup. On the third sync, preso.key/index.apxl.gz is modified again, so you store a new backup version of it and remove the backup version of preso.key/splash.png.
BTW, another way to save space would be hard-linking. If you want to store two "full" versions of a big package without without wasting space, just store one copy of each unchanged file and hard-link it into both backups.

Resources