Mercurial: is it possible to compress .hg folder to several large BLOBs? - performance

Issue: cloning mercurial repository over network takes too much time (~ 12 minutes). We suspect it is because .hg directory contains a lot of files (> 15 000).
We also have git repository which is even larger, but clone performance is quite good - around 1 minute. Looks like it's because .git folder which is transferred over network has only several files (usually < 30).
Question: does Mercurial support "repository compressing to single blob" and if it does how to enable it?
Thanks
UPDATE
Mercurial version: 1.8.3
Access method: SAMBA share (\\server\path\to\repo)
Mercurial is installed on Linux box, accessed from Windows machines (by Windows domain login)

Mercurial use some kind of compression to send data on the network ( see http://hgbook.red-bean.com/read/behind-the-scenes.html#id358828 ), but by using Samba, you totally bypass this mechanism. Mercurial thinks the remote repository is on a local filesystem and the mechanism used is different.
It clearly says in the linked documentation that each data are compressed as a whole before sending :
This combination of algorithm and compression of the entire stream
(instead of a revision at a time) substantially reduces the number of
bytes to be transferred, yielding better network performance over most
kinds of network.
So you won't have the problem of 15'000 files you use a "real" network protocol.
BTW, I strongly recommend against using something like Samba to share your repository. This is really asking for various kind of problems :
lock problems when multiple people attempt to access the repository at the same time
file right problems
file stats problems
problems with symlink management if used
You can find information about publishing repositories on the wiki : PublishingRepositories (where you can see that samba is not recommended at all)
And to answer the question, AFAIK, there's no way to compress the Mercurial metadata or anything like that like reduce the number of files. But if the repository is published correctly, this won't be a problem anymore.

You could compress it to a blob by creating a bundle:
hg bundle --all \\server\therepo.bundle
hg clone \\server\therepo.bundle
hg log -R therepo.bundle
You do need to re-create or update the bundle periodically, but creating the bundle is fast and could be done in a post-changeset hook on the server, or nightly. (Since fetching remaining changesets can be done by pulling the repo after cloneing from bundle, if you set [paths] correctly in .hg/hgrc).
So, to answer your question about several blobs, you could create a bundle every X changesets, and have the clients clone/unbundle each of those. (However, having a single one updated regularly + a normal pull for any remaining changesets seems easier...)
However, since you're running Linux on the server anyway, I suggest running hg-ssh or hg-web.cgi. That's what we do and it works well for us. (With windows clients)

Related

How do I stop OneDrive from downloading git.exe on Windows?

I have used Git on Windows for a while, but recently changed the setting and got this.
On almost every command for Git Bash (also on PowerShell and Github Desktop) I get
git.exe is being downloaded on OneDrive
(translation may not be exactly the same)
The setting that changed recently is moving my repos to a OneDrive folder in order to have them synced between two sessions: that is work desktop and remote virtual machine.
I can see that this may not be ideal, but it really works for me since I have the same settings on both sessions, and not really get used to doing many commit-push-pull. Not the main topic here, but feel free to comment.
(Edit): Upon reading solution, there are other ways to set this syncing that doesn't mess up with the internals of Git. Look for that instead. Thanks.
In any case, the strange thing is that the notifications happen only on the Remote Virtual Machine, but not on the desktop.
I have seen some notifications about some files in the repos, which I then attribute to OneDrive being nosy about every move I make file I move. But then I've also seen files I don't know about, and theres always git.exe attached to the notification.
In the first scenario I have tried tuning down the notifications for OneDrive. Some might say Microsoft does have a background for not letting users setup their notifications, so I'm still looking.
Thanks.
Most file syncing tools like OneDrive and Dropbox operate by syncing data file by file. This is a great approach if you're working on a single word-processing document or spreadsheet. However, it's not as great when you're working with a Git repository.
When changing between branches or making a commit, Git changes and creates a lot of files all at once. In order to be synced correctly, all of the created files must be written in a similar order: all the blobs must be written, then the trees, then the commits, and then the refs can be updated. If you do this out of order, your repository can be corrupted, since you can have branches that refer to objects that don't exist (or objects that refer to other objects that don't exist).
In addition, these tools can end up deleting files you wanted to have in your working tree or recreating files you didn't. So overall, you don't want to sync any Git repository using one of these tools.
You can write a bundle file with git bundle and sync that, or you can use rsync to sync a repository provided it's idle (not being modified) when you do. Note that if you sync a working tree, Git will need to refresh all files when you sync it across to the new machine, and also Git doesn't try to defend against untrusted users who have access to the working tree.
It's also not a good idea to sync your Git installation itself via OneDrive, which is what it sounds like might be happening. Instead, install Git for Windows on each machine independently and don't try to sync it across. OneDrive should have configuration options that let you control what's synced.

Hg repository corruption when using Windows network shared directory

I hope I can get some help here as SO UX is better than Mercurial mail list.
I've been happily using Mercurial at home for years. I am also using it with Bitbucket Cloud for a couple of more serious (but still hobby) projects.
Last year I switched my team at work from SVN (company hosted) to Hg (self-hosted, with the central repo on a network location). We are all in Windows. Since then, we're continuously having problems with severe central repository corruption, which can only be resolved using backup, e.g.:
% hg verify --verbose
repository uses revlog format 1
checking changesets
checking manifests
manifest#92: unknown parent 1 ef0f96d78ab6 of ef0f96d78ab6
manifest#92: reading delta ef0f96d78ab6: integrity check failed on
00manifest.i:88
manifest#93: unknown parent 1 e336adb3580b of e336adb3580b
manifest#93: reading delta e336adb3580b: integrity check failed on 00manifest.i:89
manifest#94: reading delta 7243aebd542b: unknown compression type '\x08'
manifest#95: reading delta 899e4507ca01: unpack requires a string argument of length 12
manifest#96: reading delta 12d4d930da4f: Manifest had an entry with a zero-length filename.
...
Some people say we shouldn't use a network share for the central repository, due to problems with locking. Others explain that Mercurial doesn't use those locks, and network shares should work fine, unless there are problems with the file system.
Considering the latter, I wonder if I could somehow debug our installation without asking the company to provide a server for hg. I don't know much about the configuration we are using, but here is what I see. The directory is accessible via a Windows network path: \\domain.com\path\path\our-directory. Inside, we created a directory called root where .hg resides. In .hgrc, the path is accordingly
[paths]
default = \\domain.com\path\path\our-directory\root
Our network directory is backed up (by the company). Hg version is 4.9.
I have had a similar experience with a similar setup.
First thing to note is that I thought older HG versions definitely did have some problems when run over Windows network file shares, so make sure your version is current. (That was years ago, IIRC, so this may be unlikely to be the root cause of your present issue).
Secondly, in my case these problems seemed to be compounded from running HG from within a virtual machine. Instead I now run an [hg serve][1] instance on a PC which is not virtualized, and hit that with the various HG clients. No more problems.
It appeared that if the connection between the PC running hg serve and the file server was more reliable than from where I ran hg as a client, this avoided the problem. Apparently the HTTP connection hg serve uses to the client is itself more reliable.
I can't say that is a definitive solution because I never found a root cause. But this seems to have avoided any more corruption for quite some time.
Note that hg serve is built right into the standard hg command line tool, you can run it from anywhere easily, and it doesn't have to run on the same server where the physical repository is stored. So in my case I use it quite casually; (obviously) you might need to coordinate with your IT people if you need something more robust.

Make commits to a remote Git repo without a local one?

My question is simple:
I have a project with a local Git repo, and I have pushed it to Bitbucket. What I'm trying to do is to remove the local repo and commit my project just to the remote repo, so that I don't have a double-sized project on my hard drive.
Is there any good solution for this?
More details
I'm worrying that the .git folder might drain my hard drive. Creating my local Git repo keeps all the files, and I ended up creating a project that's twice as big. The app deals with media files...it's 500 MB without Git.
Creating local git keeps all files
Yes, that's what git does. From a cursory Google search:
Rather than a single, central repository on which clients synchronize, each peer's working copy of the codebase is a bona fide repository ... [this] results in some important differences from a centralized system
...
Each working copy effectively functions as a remote backup of the codebase and of its change-history, protecting against data loss
...
Allows users to work productively when not connected to a network.
Makes most operations much faster.
Allows private work, so users can use their changes even for early drafts they do not want to publish.
...
Avoids relying on one physical machine as a single point of failure.
As for your "problem"
I'm worrying about that .git folder might drain my hard drive.
A git repository of Firefox is 200 MB. Consider the size of your project relative to Firefox, and then be prepared to set aside a generous ten thousand, two-hundred and forty kilobytes for your project's git repository.
I'd like to point out a few things:
Git can only make commits to a local repo.
Git compresses files in its repo.
Git is ill-suited for versioning binaries.
See each section below for full explanation.
Git can only make commits to a local repo
There is no way I know of in Git to make a commit directly to a remote repository, without having to go through a local one first. That's not how Git works. If you want to make commits, I think you can only do so with a local repository.
Git compresses files in its repo
Files under the .git directory are compressed, so the Git repo at the .git directory will probably be much smaller than your working copy checkout, especially if it's just full of text and not binary files (more on binary files later). At work, I use a Git repo that's about 300 MB, but the working copy is around 2.5 GB, so the actual repo itself is much smaller in comparison.
Compression settings for Git
You can configure Git to use different compression levels:
core.compression
An integer -1..9, indicating a default compression level. -1 is the zlib default. 0 means no compression, and 1..9 are various speed/size tradeoffs, 9 being slowest. If set, this provides a default to other compression variables, such as core.loosecompression and pack.compression.
core.loosecompression
An integer -1..9, indicating the compression level for objects that are not in a pack file. -1 is the zlib default. 0 means no compression, and 1..9 are various speed/size tradeoffs, 9 being slowest. If not set, defaults to core.compression. If that is not set, defaults to 1 (best speed).
pack.compression
An integer -1..9, indicating the compression level for objects in a pack file. -1 is the zlib default. 0 means no compression, and 1..9 are various speed/size tradeoffs, 9 being slowest. If not set, defaults to core.compression. If that is not set, defaults to -1, the zlib default, which is "a default compromise between speed and compression (currently equivalent to level 6)."
Note that changing the compression level will not automatically recompress all existing objects. You can force recompression by passing the -F option to git-repack(1).
You can read more about Git packfiles from the free online Pro Git book.
Git is ill-suited for versioning binaries
Finally, the original poster makes this comment:
well...the app is dealing with media files... it's 500mb without git
Git is not well-suited to versioning binary files (like media files, like pictures, videos, audio clips, etc), because Git can't keep text diff deltas of changes to them like it can with text files, it actually has to keep each version of a binary in its entirety every time you make changes to it.
So if you have a 1 MB picture file called logo.jpg, and you make a small change to it, Git would have to store the whole logo.jpg file all over again, adding another 1 MB to your repository.
Solution 1: Remove binaries with git filter-branch
If your media files don't actually need to be versioned in Git, consider removing them from your repository using git filter-branch. You can read more about this option at the official Linux Kernel Git documentation for git filter-branch and at the "The Nuclear Option: filter-branch" of the free online Pro Git book.
Solution 2: use 3rd party services for media files instead
GitHub makes this suggestion for dealing with large media files:
Binary media files don't get along very well with Git. For these files it's usually best to use a service specifically designed for what you're using.
For large media files like video and music you should host the files yourself or using a service like Vimeo or Youtube.
For design files like PSDs and 3D models, a service like Dropbox usually works quite nicely. This is what GitHub's designers use to stay in sync; Only final image assets are committed into our repositories.
More about Git and versioning binary files
You can learn more about Git and versioning binaries in these other Stack Overflow questions:
Managing large binary files with git.
git with large files.

Check in - Check out process/version control for PSDs and Image files

The title may not be so clear but the issue I am facing is this:
Are designers are working on large photoshop files across the network, this has a number of network traffic and file corruption issues which I am trying to overcome.
The way I want to do this is to have the designers copy the the files to their machine (Mac OSX) and work on them locally. But the problem then stands that they may forget to copy them back up or that another designer may start work on the version stored on the network.
What I need is a system where the designer checks out the files or folders from the server which locks those files so no other user can copy them until they are checked back in. We do not need to store revisions for the files.
My initial idea was to use SVN or preferably GIT and force lock on checkout somehow, does this sound feasible or is there a better system?
How big are the files on average? Not sure about GIT haven't used it but SVN should be ok - If you did go with SVN I would trial checking out over Http/Https vs Network Path to the repo as you may get a speed advantage out of one or the other. When we vpn to our repo at work it is literally 100 times faster over http than checking out using a network \\path to the repo.
SVN is a good option, but you will have revisions (this is the whole point of SVN). SVN doesn't lock files by default, but you may configure it so that it does. See http://svnbook.red-bean.com/nightly/en/svn-book.html?bcsi_scan_554E00F99A9AD604=0&bcsi_scan_filename=svn-book.html#svn.advanced.locking
I don't know git very well, but since it's not a centralized VCS, I'm pretty sure it isn't the right tool for your situation.

Concurrency in a GIT repo on a network shared folder

I want to have a bare git repository stored on a (windows) network share. I use linux, and have the said network share mounted with CIFS. My coleague uses windows xp, and has the network share automounted (from ActiveDirectory, somehow) as a network drive.
I wonder if I can use the repo from both computers, without concurrency problems.
I've already tested, and on my end I can clone ok, but I'm afraid of what might happen if we both access the same repo (push/pull), at the same time.
In the git FAQ there is a reference about using network file systems (and some problems with SMBFS), but I am not sure if there is any file locking done by the network/server/windows/linux - i'm quite sure there isn't.
So, has anyone used a git repo on a network share, without a server, and without problems?
Thank you,
Alex
PS: I want to avoid using an http server (or the git-daemon), because I do not have access to the server with the shares. Also, I know we can just push/pull from one to another, but we are required to have the code/repo on the share for back-up reasons.
Update:
My worries are not about the possibility of a network failure. Even so, we would have the required branches locally, and we'll be able to compile our sources.
But, we usually commit quite often, and need to rebase/merge often. From my point of view, the best option would be to have a central repo on the share (so the backups are assured), and we would both clone from that one, and use it to rebase.
But, due to the fact we are doing this often, I am afraid about file/repo corruption, if it happens that we both push/pull at the same time. Normally, we could yell at each other each time we access the remote repo :), but it would be better to have it secured by the computers/network.
And, it is possible that GIT has an internal mechanism to do this (since someone can push to one of your repos, while you work on it), but I haven't found anything conclusive yet.
Update 2:
The repo on the share drive would be a bare repo, not containing a working copy.
Git requires minimal file locking, which I believe is the main cause of problems when using this kind of shared resource over a network file system. The reason it can get away with this is that most of the files in a Git repo--- all the ones that form the object database--- are named as a digest of their content, and immutable once created. So there the problem of two clients trying to use the same file for different content doesn't come up.
The other part of the object database is trickier-- the refs are stored in files under the "refs" directory (or in "packed-refs") and these do change: although the refs/* files are small and always rewritten rather than being edited. In this case, Git writes the new ref to a temporary ".lock" file and then renames it over the target file. If the filesystem respects O_EXCL semantics, that's safe. Even if not, the worst that could happen would be a race overwriting a ref file. Although this would be annoying to encounter, it should not cause corruption as such: it just might be the case that you push to the shared repo, and that push looks like it succeeded whereas in fact someone else's did. But this could be sorted out simply by pulling (merging in the other guy's commits) and pushing again.
In summary, I don't think that repo corruption is too much of a problem here--- it's true that things can go a bit wrong due to locking problems, but the design of the Git repo will minimise the damage.
(Disclaimer: this all sounds good in theory, but I've not done any concurrent hammering of a repo to test it out, and only share them over NFS not CIFS)
Why bother? Git is designed to be distributed. Just have a repository on each machine and use the publish and pull mechanism to propagate your changes between them.
For backup purposes, run a nightly task to copy your repository to the share.
Or, create one repository each on the share and do your work from them but use them as distributed repositories from which you can pull changesets from each other. If you use this method, then performance of doing builds and so on will be decreased since you will be constantly accessing over the network.
Or, have distributed repositories on your own computers, and run a periodic task to push your commits to the repositories on the share.
Sounds just as if you'd rather like to use a centralized versioning system, so the query for backup is satisifed.
Perhaps with xxx2git in between for you to work locally.

Resources