Does Number of commits can affect svn performance - performance

I was about to commit about 1000 files at once after few refactoring stuff. Is it advisable to commit such huge number of files or I should commit them in batches. I am trying to look at pros and cons sort of.
One of the pros is that I will have same entry in the SVN for all my changes and will be easy to navigate.

With a number of files as small as 1000, I would worry less about performance and more about correct work flow. 1000 files is a lot of files and thus a lot of changes, but Subversion should handle it reasonably well.
However, if all of the changes are not actually 1 change, then it should not be one commit. For example, if you're renaming 3 functions, I would make each rename a separate commit. Depending on what specifically you're doing, you may be able to get away with one commit, but a year from now when you're browsing through the logs, you'll make life easier on yourself if you tend to stick to small commits. If it really is only one change, then one commit is definitely your best option (for example, renaming one function).

SVN can handle 1000 files at once. The only reason to check in batches is to give each batch a different commit message, like "fixed bug #22" and "added flair".

the number of files doesnt really matter.
when you commit changes to your code repo, you should be thinking of build statiblity and test compliance.
That answers your question: If you have made changes to n files and only commit some of them, then you're likely to break the build (not even talking abt the tests). So you should commit all necessary files to guarantee build integrity at least.
svn and other tools are well capable of dealing with such nb of files, which will represent a single transaction on the server.

Related

How to manually specify a git commit sha?

This answer explains that normally a git commit SHA is generated based on various parameters. However, I would like to know: how can one specify a custom/particular/specific git commit sha (in Bash)?
For example, suppose one wants to create and push a commit to Git with the following sha:
1e23456ffd118db9dc04caf40a442040e5ec99f9
(For simplicity, assume one can assume it is a unique sha).
The XY-problem is a manual mirror script between two different Git servers. It would be more convenient to simply have identical commit SHA's than to keep a mapping of the commits between the Git servers. This is because the manual mirror is more efficient (saving computation time and server bandwidth) if I can skip certain commits from the source server. Yet that means the parent commits change in the target server, with respect to the same commit in the source server. In turn, that would imply the SHA changes, which would require me to keep track of a mapping of the sha's in the source and target server. In short, it would be more convenient to simply override the sha's of the commits to the target server, than to ensure the two servers have the exact same commits (for the few commits that are actually mirrored).
A commit SHA isn't just "normally" generated based on those parameters, it is by definition a hash of those parameters. "SHA" is the name of the hashing algorithm used to generate it.
Rather than trying to change the commit hashes, you should look for an efficient way to track them. One approach would be similar to how plugins like git svn work:
When copying a commit to the mirror, record the original commit hash as part of the new commit's commit message.
Possibly, since you're "skipping" commits in the original repo, each new commit should have multiple source hashes, since it will act like a "squash" of those commits.
Have a script which processes the result of git log and extracts these recorded commit hashes. This can then be used instead of the real commit hashes when determining what new commits to copy from the source.
However, make sure this is all worth it: if the eventual changes are all included, the chances are that git's existing de-duplication and compression will mean the overhead of the "skipped" commits is fairly low.
Since you've already outlined in your question that you have ways of handling your differences, I will assume this question is really and only this:
I would like to know: how can one specify a custom/particular/specific git commit sha (in Bash)?
And not "or do you have any other ideas that I could use instead".
And with that question, the answer is actually quite simple:
You can't.
Git doesn't just calculate the commit id because that's just a by-product of the implementation chosen. The way it is done is a core concept of how git is designed.
The commit id is calculated based upon the content of the commit, and this includes, as you have observed, the link to the parent. Change the parent but keep everything else identical, the commit id still changes.
This is core to how the distributed part of the version control system works, and cannot be changed.
You simply cannot change the id of a commit and keep the contents of it the same. This is by design
There has been some attempts at doing commit collisions by carefully constructing distinct commits that end up having the same id.
Here's such a successful attempt (collision): https://www.theregister.com/2017/02/23/google_first_sha1_collision/
First ever' SHA-1 hash collision calculated. All it took were five clever brains... and 6,610 years of processor time
I don't believe anyone yet have managed to take an arbitrary commit and then targeting a specific commit id with it. The collisions were carefully constructed by manipulating two commits simultaneously according to very specific criteria such that they arrived at the same id, but that id was not chosen by the researches.
TL;DR: It can't be done
The net effect of the collision(s) generated though is that Git will move away from SHA-1 at some point and go for a system that produces longer, and "more secure" (tm) hashes than what we have today. Since Git also wants to be backwards compatible with existing repositories, this work is not yet fully completed.
From the comment by CodeCaster, it seems I could use the freely choosable bits in the commit message in `git commit -m "some message" to ensure the sha of the commit ends up with a specific value.
However, based on the comment by Lasse V. Karlsen I would assume this approach requires non-linear computation resources. I did not go into detail in this, however I imagine/assume that as the commit history grows, the relative impact of the (limited (5mb) ) freely choosable bits of the commit message becomes smaller. I guess that could be an explanation on why leveraging these freely choosable bits in the commit message becomes costly.
So in practice, the answer seems to be: "You could (perhaps, if you spend a lot of computational resources), but you shouldn't.".
how can one specify a custom/particular/specific git commit sha (in Bash)?
One cannot. The commit hash is a value constructed, as you say, by hashing various values together, and the whole point is to uniquely identify a particular commit. You could commit the same set of files at a different time on a different machine and you'd end up with a different commit hash.
The way to ensure that you have the same commits on two different machines is to git pull (or similar) those commits from one machine to the other.
You don't necessarily have to move all the commits -- you could e.g. squash them or cherry-pick only certain commits.

What are the advantages of a rebase over a merge in git?

In this article, the author explains rebasing with this diagram:
Rebase: If you have not yet published your
branch, or have clearly communicated
that others should not base their work
on it, you have an alternative. You
can rebase your branch, where instead
of merging, your commit is replaced by
another commit with a different
parent, and your branch is moved
there.
while a normal merge would have looked like this:
So, if you rebase, you are just losing a history state (which would be garbage collected sometime in the future). So, why would someone want to do a rebase at all? What am I missing here?
There are variety of situations in which you might want to rebase.
You develop a few parts of a feature on separate branches, then realize they're in reality a linear progression of ideas. Rebase them into that configuration.
You fork a topic from the wrong place. Maybe it's too early (you need something from later), maybe it's too late (it actually applies to previous versions as well). Move it to the right place. The "too late" case actually can't be fixed by a merge, so rebase is critical.
You want to test the interaction of a branch with another branch, but for some reason don't want to merge. For example, you might want to see what conflicts crop up commit-by-commit, instead of all at once.
The general theme here is that excessive merging clutters up the history, and rebasing is a way to avoid it if you didn't get your branch/merge plan right at first. Too many merges can make it hard for a human to follow the history, and also can make it harder to use tools like git-bisect.
There are also all the many cases which prompt an interactive rebase:
Multiple commits should've been one commit.
A commit (not the current one) should've been multiple commits.
A commit (not the current one) had a mistake in it or its message.
A commit (not the current one) should be removed.
Commits should be reordered (e.g. to flow more logically).
While it's true that you "lose history" doing these things, the reality is that you want to only publish clean work. If something is still unpublished, it's okay to rebase it in order to transform it to the way you should have committed it. This means that the final version in the public repository will be logical and easy to follow, not preserving any of the hiccups a developer had along the way.
Rebasing allows you to pick up merges in the proper order. The theory behind merging means you shouldn't have to worry about that. The reality of resolving complicated conflicts gets easier if you rebase, then merge new changes in order.
You might want to read up on Bunny Hopping

Can Visual Studio (should it be able to) compute a diff between any two changesets associated with a work item?

Here is my use case:
I start on a project XYZ, for which I create a work item, and I make frequent check-ins, easily 10-20 in total. ALL of the code changes will be code-read and code-reviewed.
The change sets are not consecutive - other people check-in in-between my changes, although they are very unlikely to touch the exact same files.
So ... at the end of the project I am interested in a "total diff" - as if there was a single check-in by me to complete the entire project. In theory this is computable. From the list of change sets associated with the work item, you get the list of all files that were affected. Then, the algorithm can aggregate individual diffs over each file and combine them into one. It is possible that a pure total diff is uncomputable due to the fact that someone else renamed files, or changed stuff around very closely, or in the same functions as me. I that case ... I suppose a total diff can include those changes by non-me as well, and warn me about the fact.
I would find this very useful, but I do not know how to do t in practice. Can Visual Studio 2008/2010 (and/or TFS server) do it? Are there other source control systems capable of doing this?
Thanks.
You can certainly compute the 'total diff' yourself - make a branch of the project from the revision just prior to your first commit, then merge all your changesets into it.
I don't think this really is a computable thing in the general case - only contiguous changesets can be merged automatically like this. Saying it's 'unlikely' for others to have touched the files you're working on in the interleving commits doesn't cut it, you need guarantees to be able to automate this sort of thing.
You should be working on a branch of your own if you want to be able to do this easily.
The ability to generate diff information for display or for merge purposes is functionality provided by your version control system, as Mahesh Velaga commented on another answer. If you were able to compute the diff by cherry-picking non-contiguous changesets, then logically you would also be able to merge those changes in a single operation. But this is not supported by TFS. So I strongly suspect that the construction of the cherry-picked diff information is also not supported by TFS. Other version control systems (git, mercurial, darcs come to mind) might have more support for something like this; I don't know for sure.
From my reading of their answers on the TFS version control forums, I think that their recommendation for this would be to create a branch of your own for doing this work in the first place: then the changesets would be contiguous on that branch and creating the "total diff" would be trivial. Since it sounds like you are working on an independent feature anyway (otherwise a diff of only your changes would be meaningless), you should consider having an independent branch for it regardless of whether your version control system is TFS or something else.
The alternative is to construct what such a branch would have looked like after the fact, which is essentially what Jim T's answer proposes. You might prefer that approach if your team is very keen on everyone working in the same kitchen, as it were. But as you are already aware, things can get messy that way.
Create two workspaces. Get Specific Version for files specifying the date or upto those two changeset on those two workspace. Now compare folders using a compare tool. Araxis merge is best one.
sounds like you need a tool that supports changesets (changes over multiple files and committing them all at once) instead of committing each file alone
take a look at this comparison between sourcesafe and mercurial ( free and you can find tools to integrate it with visual studio )

Refactoring and non-refactoring changes as separate check-ins?

Do you intermingle refactoring changes with feature development/bug fixing changes, or do you keep them separate? Large scale refactorings or reformatting of code that can be performed with a tool like Resharper should be kept separate from feature work or bug fixes because it is difficult to do a diff between revisions and see the real changes to code in amongst the numerous refactoring changes. Is this a good idea?
When I remember, I like to check in after a refactoring in preparation for adding a feature. Generally it leaves the code in a better state but without a change in behaviour. If I decide to back out the feature, I can always keep the better structured code.
Keep it simple.
Every check in should be a distinct, single, incremental change to the codebase.
This makes it much easier to track changes and understand what happened to the code, especially when you discover that an obscure bug appeared somewhere on the 11th of last month. Trying to find a one-line change amidst a 300-file refactoring checkin really, really sucks.
Typically, I check in when I have done some unit of work, and the code is back to compiling/unit tests are green. That may include refactorings. I would say that the best practice would be to try to separate them out. I find that to be difficult to do with my workflow.
I agree with the earlier responses. When in doubt, split your changes up into multiple commits. If you don't want to clutter the change history with lots of little changes (and have your revisions appear as one atomic change), perform these changes in a side branch where you can split them up. It's a lot easier to read the diffs later (and be reassured that nothing was inadvertently broken) if each change is clear and understandable.
Don't change functionality at the same time you are fixing the formatting. If you change the meaning of a conditional so that a whole bunch of code can be outdented, change the logic in one change and perform the outdent in a subsequent change. And be explicit with your commit messages.
If the source code control system allows it..
(this does not work in my current job due to the source code control system not liking a single user checking out a single file to more than one location.)
I have two working folders
Both folders are checkout from the same branch
I use one folder to implement the new feature development/bug fixing changes
In the other folder I do the refactoring,
After each refactoring I check in the refactoring folder
Then update the new feature development folder that merges in my refactorings
Hence each refactoring is in own checkin and other developers get the refactoring quickly, so there are less merge problems.

Does TFS lose its link when you move a branch?

My co-worker is trying to merge his development branch back into the baseline. Even though he only modified a couple files, all files in the baseline are being checked out for merging. As if it's a baseless merge. What gives?
I don't experience this and the only difference I can see is that I branched directly from the baseline and he made a branch and then did a "move" on the branch. Does moving a branch mess up the link back to the baseline? He is still able to select the baseline in the GUI so I don't think it's doing a baseless merge since that's only available via command line, but it's behaving like that.
Anyone got some insight or know what else we should check?
This is by design. TFS needs to mark the changeset where you moved the source branch as "already accounted for" so it's no longer a candidate next time you merge.
Merge history is recorded at Checkin time by updating all of the pending changes that have their Merge bit set. Ordinarily, this is accompanied by other change types like Edit, Delete, etc. If not, it's just a recordkeeping transaction like the case you've encountered. (there are other cases) No files will be modified by checking in the "no-op" merges.

Resources