My question is simple:
I have a project with a local Git repo, and I have pushed it to Bitbucket. What I'm trying to do is to remove the local repo and commit my project just to the remote repo, so that I don't have a double-sized project on my hard drive.
Is there any good solution for this?
More details
I'm worrying that the .git folder might drain my hard drive. Creating my local Git repo keeps all the files, and I ended up creating a project that's twice as big. The app deals with media files...it's 500 MB without Git.
Creating local git keeps all files
Yes, that's what git does. From a cursory Google search:
Rather than a single, central repository on which clients synchronize, each peer's working copy of the codebase is a bona fide repository ... [this] results in some important differences from a centralized system
...
Each working copy effectively functions as a remote backup of the codebase and of its change-history, protecting against data loss
...
Allows users to work productively when not connected to a network.
Makes most operations much faster.
Allows private work, so users can use their changes even for early drafts they do not want to publish.
...
Avoids relying on one physical machine as a single point of failure.
As for your "problem"
I'm worrying about that .git folder might drain my hard drive.
A git repository of Firefox is 200 MB. Consider the size of your project relative to Firefox, and then be prepared to set aside a generous ten thousand, two-hundred and forty kilobytes for your project's git repository.
I'd like to point out a few things:
Git can only make commits to a local repo.
Git compresses files in its repo.
Git is ill-suited for versioning binaries.
See each section below for full explanation.
Git can only make commits to a local repo
There is no way I know of in Git to make a commit directly to a remote repository, without having to go through a local one first. That's not how Git works. If you want to make commits, I think you can only do so with a local repository.
Git compresses files in its repo
Files under the .git directory are compressed, so the Git repo at the .git directory will probably be much smaller than your working copy checkout, especially if it's just full of text and not binary files (more on binary files later). At work, I use a Git repo that's about 300 MB, but the working copy is around 2.5 GB, so the actual repo itself is much smaller in comparison.
Compression settings for Git
You can configure Git to use different compression levels:
core.compression
An integer -1..9, indicating a default compression level. -1 is the zlib default. 0 means no compression, and 1..9 are various speed/size tradeoffs, 9 being slowest. If set, this provides a default to other compression variables, such as core.loosecompression and pack.compression.
core.loosecompression
An integer -1..9, indicating the compression level for objects that are not in a pack file. -1 is the zlib default. 0 means no compression, and 1..9 are various speed/size tradeoffs, 9 being slowest. If not set, defaults to core.compression. If that is not set, defaults to 1 (best speed).
pack.compression
An integer -1..9, indicating the compression level for objects in a pack file. -1 is the zlib default. 0 means no compression, and 1..9 are various speed/size tradeoffs, 9 being slowest. If not set, defaults to core.compression. If that is not set, defaults to -1, the zlib default, which is "a default compromise between speed and compression (currently equivalent to level 6)."
Note that changing the compression level will not automatically recompress all existing objects. You can force recompression by passing the -F option to git-repack(1).
You can read more about Git packfiles from the free online Pro Git book.
Git is ill-suited for versioning binaries
Finally, the original poster makes this comment:
well...the app is dealing with media files... it's 500mb without git
Git is not well-suited to versioning binary files (like media files, like pictures, videos, audio clips, etc), because Git can't keep text diff deltas of changes to them like it can with text files, it actually has to keep each version of a binary in its entirety every time you make changes to it.
So if you have a 1 MB picture file called logo.jpg, and you make a small change to it, Git would have to store the whole logo.jpg file all over again, adding another 1 MB to your repository.
Solution 1: Remove binaries with git filter-branch
If your media files don't actually need to be versioned in Git, consider removing them from your repository using git filter-branch. You can read more about this option at the official Linux Kernel Git documentation for git filter-branch and at the "The Nuclear Option: filter-branch" of the free online Pro Git book.
Solution 2: use 3rd party services for media files instead
GitHub makes this suggestion for dealing with large media files:
Binary media files don't get along very well with Git. For these files it's usually best to use a service specifically designed for what you're using.
For large media files like video and music you should host the files yourself or using a service like Vimeo or Youtube.
For design files like PSDs and 3D models, a service like Dropbox usually works quite nicely. This is what GitHub's designers use to stay in sync; Only final image assets are committed into our repositories.
More about Git and versioning binary files
You can learn more about Git and versioning binaries in these other Stack Overflow questions:
Managing large binary files with git.
git with large files.
Related
(This is not a duplicate of How does git detect that a file has been modified? because I'm asking about Windows, the referenced QA mentions stat and lstat, which do not apply to Windows).
With traditional systems like SVN and TFS, the "state database" needs to be explicitly and manually informed of any changes to files in your local workspace: files are read-only by default so you don't accidentally make a change without explicitly informing your SVN/TFS client first. Fortunately IDE integration means that operations that result in the addition, modification, deletion and renaming (i.e. "checking-out") of files can be automatically passed on to the client. It also means that you would need something like TortoiseSVN to work with files in Windows Explorer, lest your changes be ignored - and that you should regularly run an often lengthy Server-to-Local comparison scan to detect any changes.
But Git doesn't have this problem - on my Windows machine I can have a gigabyte-sized repo with hundreds of thousands of files, many levels deep, and yet if I make a 1 byte change to a file nested very deeply, I can see that Git knows after running git status. This is the strange part - because git does not use any daemon processes or background tasks - running git status also does not involve any significant IO activity that I can see, I get the results back immediately, it does not thrash my disk searching for the change I made.
Additionally, Git GUI tools, such as the Git integration with Visual Studio 2015 also have some degree of magic in them - I can make a change in Notepad or another program, and VS' Git Changes window picks it up immediately. VS could simply be using ReadDirectoryChanges (FileSystemWatcher) - but when I look at the devenv process in Process Explorer I don't see any corresponding handles, but that also doesn't explain how git status sees the changes.
Git runs a Windows equivalent of the POSIX-y lstat(2) call on each file recorded in the index to have the first stab at figuring out whether the file is modified or not. It compares the modification time and size taken from that information with the values recorded for that file in the index.
This operation is notoriously slow on NTFS (and network-mapped drives) so since some time Git for Windows gained a special tweak controlled with the core.fscache configuration option which became enabled by default some 2 or 3 GfW releases ago. I don't know the exact details but it tries to minimize the number of times Git needs to lstat(2) your files.
IIUC, the mechanism enabled by core.fscache is not making use of filesystem watching Win32 API as Git runs no daemons/services on your system; so it merely optimizes the way Git asks the filesystem layer about the stat info of the tracked files.
As Briana Swift and kostix point out - it is scanning your disk. However, when looking for unstaged changes, it does not need to read every file on your disk. Instead, it can look at the metadata stored in the index to determine what files to examine more closely (actually reading them).
If you use the git-ls-files command to examine the index, you can see this metadata:
% git ls-files --debug worktree.c
worktree.c
ctime: 1463782535:0
mtime: 1463782535:0
dev: 16777220 ino: 120901250
uid: 501 gid: 20
size: 5591 flags: 0
Now if you run git status, git will look at worktree.c on disk. If the timestamps and filesize match, then git will assume that you have not changed this file.
If, however, the timestamps and filesize do not match, then git will look more closely at the file to determine if you have changed it or not.
So git does "thrash" the disk, but in a much more limited manner than if you did something like tf reconcile to examine your changes. (TFVC, of course, was designed to deal with very large working trees and should never touch your disk if you're using it correctly.)
And yes - Visual Studio does have some magic in it. It runs a background filesystem watcher in both your working directory and some parts of the Git repository. When it notices a change in your working directory, it will re-compute the git status. It also looks at changes to branches in the Git repository to know when you've switched branches or to recompute the status of your local repository with your remote.
Git's process of git status is very lightweight.
git status checks the index (also known as staging area, before you run git add) and the working directory (after git add but before git commit), then compares those files with the last committed version. Instead of having to go through every file in the repository, Git first checks these areas to see what to look up in the most recent commit.
git diff works similarly. I suggest looking here for more information.
If I am version controlling my LaTeX docs and have a repo on bitbucket which I share with other conotributors, how do I share the png/jpg etc. files without having git tracking them?
Because every contributor should be able to compile it without LaTeX's draft check and visualize the complete paper with images, but it makes no sense to track such images with git (my .gitignore has a img/ line in it)
Check out the "Downloads" section of your Bitbucket repo. It is made for "adding any file that you would like to make available to your users, such as app binaries", which sounds pretty much like what you need. But you collaborators still have to download / unpack them manually.
Also, you can actually store binaries in Git repos. The problem is that they cannot be "delted" effectively due to Git internals and each binary file modification just duplicates all the bytes, even if you changed only one. So, if you don't change them frequently it's pretty ok. Bitbucket has a limitations on max repo size, so you'll get a warning when it is fool.
Another approach is to use Git Large File Storage which is especially created to handle binaries in Git repos. Unfortunately, it is not available on Bitbucket yet. If you can move your repo to Github consider this possibility.
I have made some changes. I cannot use those changes now. I need to discard them for now and go back to them later when the star alignment is more favorable (e.g. when our Cobol guy has enough time to get to his half of the work).
Short of using Eclipse → Synchronize with team and manually copy pasting the contents to a scratch directory so I can do the merging later, is there any way to "stash" changes for later?
There is no git stash equivalent on Serena Dimensions. The poor man's way will be to store your changes temporally on a different folder or a file with different name without including it to the source controlled solution and switching back and forth as needed.
Another alternative is to use streams in order to have your changes source controlled without affecting production code; a typical scenario is to have an Integration and Main streams. But it depends on your access level to the dimension database you are using and your project needs.
A git repo can be maintained locally to have this and other git functionality on your local computer (or even small team with shared folders or a git server) since it does not interfere with Dimensions, as long as you don't store the git metadata in the dimensions managed code and vice versa. This is not a straight forward solution and will require that you know how to set a git repo and precaution on you side when delivering to the Dimension server, but it works and is really helpful if you are familiar with git workflow.
Dimensions is not so friendly as git on this kind of usages, but way more robust for larger and more controlled projects.
Git and Dimensions work on different methodologies. Dimensions allows only to either commit a new version or discard the version, after checking out the file. As indicated above, one can still use streams or individual branches for their development work and can merge/deliver the changes later point in time, without affecting others work.
Issue: cloning mercurial repository over network takes too much time (~ 12 minutes). We suspect it is because .hg directory contains a lot of files (> 15 000).
We also have git repository which is even larger, but clone performance is quite good - around 1 minute. Looks like it's because .git folder which is transferred over network has only several files (usually < 30).
Question: does Mercurial support "repository compressing to single blob" and if it does how to enable it?
Thanks
UPDATE
Mercurial version: 1.8.3
Access method: SAMBA share (\\server\path\to\repo)
Mercurial is installed on Linux box, accessed from Windows machines (by Windows domain login)
Mercurial use some kind of compression to send data on the network ( see http://hgbook.red-bean.com/read/behind-the-scenes.html#id358828 ), but by using Samba, you totally bypass this mechanism. Mercurial thinks the remote repository is on a local filesystem and the mechanism used is different.
It clearly says in the linked documentation that each data are compressed as a whole before sending :
This combination of algorithm and compression of the entire stream
(instead of a revision at a time) substantially reduces the number of
bytes to be transferred, yielding better network performance over most
kinds of network.
So you won't have the problem of 15'000 files you use a "real" network protocol.
BTW, I strongly recommend against using something like Samba to share your repository. This is really asking for various kind of problems :
lock problems when multiple people attempt to access the repository at the same time
file right problems
file stats problems
problems with symlink management if used
You can find information about publishing repositories on the wiki : PublishingRepositories (where you can see that samba is not recommended at all)
And to answer the question, AFAIK, there's no way to compress the Mercurial metadata or anything like that like reduce the number of files. But if the repository is published correctly, this won't be a problem anymore.
You could compress it to a blob by creating a bundle:
hg bundle --all \\server\therepo.bundle
hg clone \\server\therepo.bundle
hg log -R therepo.bundle
You do need to re-create or update the bundle periodically, but creating the bundle is fast and could be done in a post-changeset hook on the server, or nightly. (Since fetching remaining changesets can be done by pulling the repo after cloneing from bundle, if you set [paths] correctly in .hg/hgrc).
So, to answer your question about several blobs, you could create a bundle every X changesets, and have the clients clone/unbundle each of those. (However, having a single one updated regularly + a normal pull for any remaining changesets seems easier...)
However, since you're running Linux on the server anyway, I suggest running hg-ssh or hg-web.cgi. That's what we do and it works well for us. (With windows clients)
The title may not be so clear but the issue I am facing is this:
Are designers are working on large photoshop files across the network, this has a number of network traffic and file corruption issues which I am trying to overcome.
The way I want to do this is to have the designers copy the the files to their machine (Mac OSX) and work on them locally. But the problem then stands that they may forget to copy them back up or that another designer may start work on the version stored on the network.
What I need is a system where the designer checks out the files or folders from the server which locks those files so no other user can copy them until they are checked back in. We do not need to store revisions for the files.
My initial idea was to use SVN or preferably GIT and force lock on checkout somehow, does this sound feasible or is there a better system?
How big are the files on average? Not sure about GIT haven't used it but SVN should be ok - If you did go with SVN I would trial checking out over Http/Https vs Network Path to the repo as you may get a speed advantage out of one or the other. When we vpn to our repo at work it is literally 100 times faster over http than checking out using a network \\path to the repo.
SVN is a good option, but you will have revisions (this is the whole point of SVN). SVN doesn't lock files by default, but you may configure it so that it does. See http://svnbook.red-bean.com/nightly/en/svn-book.html?bcsi_scan_554E00F99A9AD604=0&bcsi_scan_filename=svn-book.html#svn.advanced.locking
I don't know git very well, but since it's not a centralized VCS, I'm pretty sure it isn't the right tool for your situation.