Split large repo into multiple subrepos and preserve history (Mercurial)

Split large repo into multiple subrepos and preserve history (Mercurial) - visual-studio

We have a large base of code that contains several shared projects, solution files, etc in one directory in SVN. We're migrating to Mercurial. I would like to take this opportunity to reorganize our code into several repositories to make cloning for branching have less overhead. I've already successfully converted our repo from SVN to Mercurial while preserving history. My question: how do I break all the different projects into separate repositories while preserving their history?
Here is an example of what our single repository (OurPlatform) currently looks like:
/OurPlatform
---- Core
---- Core.Tests
---- Database
---- Database.Tests
---- CMS
---- CMS.Tests
---- Product1.Domain
---- Product1.Stresstester
---- Product1.Web
---- Product1.Web.Tests
---- Product2.Domain
---- Product2.Stresstester
---- Product2.Web
---- Product2.Web.Tests
==== Product1.sln
==== Product2.sln
All of those are folders containing VS Projects except for the solution files. Product1.sln and Product2.sln both reference all of the other projects. Ideally, I'd like to take each of those folders, and turn them into separate Hg repos, and also add new repos for each project (they would act as parent repos). Then, If someone was going to work on Product1, they would clone the Product1 repo, which contained Product1.sln and subrepo references to ReferenceAssemblies, Core, Core.Tests, Database, Database.Tests, CMS, and CMS.Tests.
So, it's easy to do this by just hg init'ing in the project directories. But can it be done while preserving history? Or is there a better way to arrange this?
EDIT::::
Thanks to Ry4an's answer, I was able to accomplish my goal. I wanted to share how I did it here for others.
Since we had a lot of separate projects, I wrote a small bash script to automate creating the filemaps and to create the final bat script to actually do the conversion. What wasn't completely apparent from the answer, is that the convert command needs to be run once for each filemap, to produce a separate repository for each project. This script would be placed in the directory above a svn working copy that you have previously converted. I used the working copy since it's file structure best matched what I wanted the final new hg repos to be.
#!/bin/bash
# this requires you to be in: /path/to/svn/working/copy/, and issue: ../filemaplister.sh ./
for filename in *
do
extension=${filename##*.} #$filename|awk -F . '{print $NF}'
if [ "$extension" == "sln" -o "$extension" == "suo" -o "$extension" == "vsmdi" ]; then
base=${filename%.*}
echo "#$base.filemap" >> "$base.filemap"
echo "include $filename" >> "$base.filemap"
echo "C:\Applications\TortoiseHgPortable\hg.exe convert --filemap $base.filemap ../hg-datesort-converted ../hg-separated/$base > $base.convert.output.txt" >> "MASTERGO.convert.bat"
else
echo "#$filename.filemap" >> "$filename.filemap"
echo "include $filename" >> "$filename.filemap"
echo "rename $filename ." >> "$filename.filemap"
echo "C:\Applications\TortoiseHgPortable\hg.exe convert --filemap $filename.filemap ../hg-datesort-converted ../hg-separated/$filename > $filename.convert.output.txt" >> "MASTERGO.convert.bat"
fi
done;
mv *.filemap ../hg-conversion-filemaps/
mv *.convert.bat ../hg-conversion-filemaps/
This script looks at every file in an svn working copy, and depending on the type either creates a new filemap file or appends to an existing one. The if is really just to catch misc visual studio files, and place them into a separate repo. This is meant to be run on bash (cygwin in my case), but running the actual convert command is accomplished through the version of hg shipped with TortoiseHg due to forking/process issues on Windows (gah, I know...).
So you run the MASTERGO.convert.bat file, which looks at your converted hg repo, and creates separate repos using the supplied filemap. After it is complete, there is a folder called hg-separated that contains a folder/repo for each project, as well as a folder/repo for each solution. You then have to manually clone all the projects into a solution repo, and add the clones to the .hgsub file. After committing, an .hgsubstate file is created and you're set to go!
With the example given above, my .hgsub file looks like this for "Product1":
Product1.Domain = /absolute/path/to/Product1.Domain
Product1.Stresstester = /absolute/path/to/Product1.Stresstester
Product1.Web = /absolute/path/to/Product1.Web
Product1.Web.Tests = /absolute/path/to/Product1.Web.Tests
Once I transfer these repos to a central server, I'll be manually changing the paths to be urls.
Also, there is no analog to the initial OurPlatform svn repo, since everything is separated now.
Thanks again!

This can absolutely be done. You'll want to use the hg convert command. Here's the process I'd use:
convert everything to a single hg repository using hg convert with a source type of svn and a dest type of hg (it sounds like you've already done this step)
create a collection of filemap files for use with hg convert's --filemap option
run hg convert with source type hg and dest type hg and the source being the mercurial repo created in step one -- and do it for each of the filemaps you created in step two.
The filemap syntax is shown in the hg help convert output, but here's the gist:
The filemap is a file that allows filtering and remapping of files and
directories. Comment lines start with '#'. Each line can contain one of
the following directives:
include path/to/file
exclude path/to/file
rename from/file to/file
So in your example your filemaps would look like this:
# this is Core.filemap
include Core
rename Core .
Note that if you have an include that the exclusion of everything else is implied. Also that rename line ends in a dot and moves everything up one level.
# this is Core.Tests
include Core.Tests
rename Core.Tests .
and so on.
Once you've created the broken-out repositories for each of the new repos, you can delete the has-everything initial repo created in step one and start setting up your subrepo configuration in .hgsub files.

Related

Loop in bash for reading repositories(folders)

I have made this script which:
Clones all repositories from Bitbucket to folder "temporary projects" . To clone the repos, script is using my "repolinks.csv" which is generally what the name says, links to repos saved as text file :)
After every repo Is cloned, the script search for all .ttf files in every folder(repo) in "temporaryprojects" and saves result(which are paths to every ttf file) as TTF-Project-Paths
OTFINFO part is reading paths to .ttf files (TTF-Projects-Paths) and give me specified info about those ttf files (family, subfamily,author etc.) and saves it as "TTF-Projects-INFO"
#!/bin/bash
cd /Users/krzysztofpaszta/temporaryprojects
for repo in $(cat /users/krzysztofpaszta/repolinks.csv); do
git clone "$repo"
echo Repo cloned to /users/krzysztofpaszta/temporaryprojects
done
#tablica z nazwa repo (repo_links na przykład) + repo pętla
echo Pobieranie ścieżek wszystkich plików TTF na urządzeniu z projektow Boombit
find "/users/krzysztofpaszta/temporaryprojects/" -name "*.ttf" > /Users/krzysztofpaszta/TTF-Projects-PATHS.csv
echo Sciezki plikow pobrane
while read in; do
otfinfo --info filename="${fullfile##*/}" >> /users/krzysztofpaszta/TTF-Projects-INFO.csv "$in"
done < /users/krzysztofpaszta/TTF-Projects-PATHS.csv
echo dane plikow pobrane
rm -vrf /Users/krzysztofpaszta/temporaryprojects/*
echo Repo deleted
Everything is working great but now I am struggling how to modify this script. As you can see, right now all the repos are downloaded and then all the info about repos are saved in one file named TTF-Project-INFO. What I need it to do is to search for every repo singly and save the results as $repo.csv (name of one repo as csv) and to it constantly to the last repo in "repolinks.csv"
I modified the script like that:
#!/bin/bash
cd /Users/krzysztofpaszta/temporaryprojects
for repo in $(cat /users/krzysztofpaszta/repolinks.csv); do
git clone "$repo"
find "/users/krzysztofpaszta/temporaryprojects/$repo" -name "*.ttf" > /Users/krzysztofpaszta/$repo.csv
echo Repo cloned to /users/krzysztofpaszta/temporaryprojects
while read in; do
otfinfo --info filename="${fullfile##*/}" >> /users/krzysztofpaszta/TTF-Projects-INFO.csv "$in"
done < /users/krzysztofpaszta/TTF-Projects-PATHS.csv
But unfortunately it is doing actually the same this. Saves every repo to "temporary projects" and then search for every .ttf file paths and then info and saves those info in one file. I think it is just pretty simple loop in bash but I have no idea how to do it properly. Could someone give me some hint? I was trying to modify the script but no luck so far.
SAMPLES OF CSV'S:
TTF-Projects-Paths.csv:
/users/krzysztofpaszta/temporaryprojects/project1/Fonts/SwallowFallsMixAllCyryl.ttf
/users/krzysztofpaszta/temporaryprojects/project2/Fonts/KOMIKAZE.ttf
/users/krzysztofpaszta/temporaryprojects/project2/Graphics/Fonts/Arial Unicode.ttf
/users/krzysztofpaszta/temporaryprojects/project3/fonts/SwallowFallsMixAll.ttf
TTF-Projects-INFO.csv:
/users/krzysztofpaszta/temporaryprojects/project1/Fonts/LiberationSans.ttf:Family: Liberation Sans
/users/krzysztofpaszta/temporaryprojects/project1/Fonts/LiberationSans.ttf:Subfamily: Regular
/users/krzysztofpaszta/temporaryprojects/project1/Fonts/LiberationSans.ttf:Full name: Liberation Sans
/users/krzysztofpaszta/temporaryprojects/project1/Fonts/LiberationSans.ttf:PostScript name: LiberationSans
etc... project2, project3 with the same info about fonts.
repolinks.csv:
https://bitbucket.org/organisation/project1-settings https://bitbucket.org/organisation/crave-man https://bitbucket.org/organisation/adverts https://bitbucket.org/organisation/pipeline https://bitbucket.org/organisation/data https://bitbucket.org/organisation/async https://bitbucket.org/organisation/audio
etc..

If you want to use $repo.csv instead of the same file for all repos, change the output file in the appending redirection:
# old
>> /users/krzysztofpaszta/TTF-Projects-INFO.csv
# new
>> /users/krzysztofpaszta/"$repo".csv
The name of the directory is not the whole URL, just the last part, which you can extract in bash using parameter expansion:
dir=${repo##*/} # Remove everything up to the last slash.
Proper indenting helps understanding of the flow:
#!/bin/bash
cd /Users/krzysztofpaszta/temporaryprojects
for repo in $(cat /users/krzysztofpaszta/repolinks.csv); do
git clone "$repo"
dir=${repo##*/}
echo Repo cloned to /users/krzysztofpaszta/temporaryprojects
find /users/krzysztofpaszta/temporaryprojects/"$dir" -name "*.ttf" > /users/krzysztofpaszta/TTF-list-"$dir".csv
while read in ; do
otfinfo --info filename="${fullfile##*/}" "$in" >> /users/krzysztofpaszta/"$dir".csv
done < /users/krzysztofpaszta/TTF-list-"$dir".csv
done
I'm not sure what $fullfile is, so I left it as was.

GitHub Branches: Case-Sensitivity Issue?

I seem to be having an issue with a repository continually recreating branches locally because of some branches on remote. I'm on a Windows machine, so I suspect that it's a case sensitivity issue.
Here's an example couple commands:
$ git pull
From https://github.com/{my-repo}
* [new branch] Abc -> origin/Abc
* [new branch] Def -> origin/Def
Already up to date.
$ git pull -p
From https://github.com/{my-repo}
- [deleted] (none) -> origin/abc
- [deleted] (none) -> origin/def
* [new branch] Abc -> origin/Abc
* [new branch] Def -> origin/Def
Already up to date.
When doing a git pull, the branches in question are capitalized. When I do a git pull -p (for pruning), it first tries to delete lowercased versions of the branches, then create capitalized versions.
The remote branches are capitalized (origin/Abc and origin/Def).
I have tried to temporarily change my Git config such that ignorecase=false (it is currently ignorecase=true). But I noticed no change in behavior. I'm guessing there's something local on my end that's currently holding onto those lowercased branches. But git branch does not show any version of these branches locally.
Short of completely obliterating the repository (a fresh git clone in a separate folder does not pull these phantom branches when trying pulls/fetches), is there anything I can do?

Git is schizophrenic about this.1 Parts of Git are case-sensitive, so that branch HELLO and branch hello are different branches. Other parts of Git are, on Windows and MacOS anyway, case-insensitive, so that branch HELLO and branch hello are the same branch.
The result is confusion. The situation is best simply avoided entirely.
To correct the problem:
Set some additional, private and temporary, branch or tag name(s) that you won't find confusing, to remember any commit hash IDs you really care about, in your own local repository. Then run git pack-refs --all so that all your references are packed. This removes all the file names, putting all your references into the .git/packed-refs flat-file, where their names are case-sensitive. Your Git can now tell your Abc from your abc, if you have both.
Now that your repository is de-confused, delete any bad branch names. Your temporary names hold the values you want to remember. You can delete both abc and Abc if one or both might be messed up. Your remember-abc has the correct hash in it.
Go to the Linux server machine that has the branches that differ only in case from yours. (It's always a Linux machine; this problem never occurs on Windows or MacOS servers because they do the case-folding early enough that you never create the problem in the first place.) There, rename or delete the offending bad names.
The Linux machine has no issues with case—branches whose name differs only in case are always different—so there is no weirdness here. It may take a few steps, and a few git branch commands to list all the names, but eventually, you'll have nothing but clear and distinct names: there will be no branches named Abc and abc both.
If there are no such problems on the Linux server, step 2 is "do nothing".
Use git fetch --prune on your local system. You now no longer have any bad names as remote-tracking names, because in step 2, you made sure that the server—the system your local Git calls origin—has no bad names, and your local Git has made your local origin/* names match their branch names.
Now re-create any branch names you want locally, and/or rename the temporary names you made in step 1. For instance if you made remember-abc to remember abc, you can just run git branch -m remember-abc abc to move remember-abc to abc.
If abc should have origin/abc set as its upstream, do that now:
git branch --set-upstream-to=origin/abc abc
(You can do this in step 1 when you create remember-abc, but I think it makes more sense here so I put it in step 4.)
There are various shortcuts you can use, instead of the 4 steps above. I listed all four this way for clarity of purpose: it should be obvious to you what each step is intended to accomplish and, if you read the rest of this, why you are doing that step.
The reason the problem occurs is outlined in nowox's answer: Git sometimes store the branch name in a file name, and sometimes stores it as a string in a data file. Since Windows (and MacOS) tends to use file-name-conflation, the file-name variant retains its original case, but ignores attempts to create a second file of the other case-variant name, and then Git thinks that Abc and abc are otherwise the same. The data-in-a-file variant retains the case-distinction as well as the value-distinction and believes that Abc and abc are two different branches that identify two different commits.
When git rev-parse refs/heads/abc or git rev-parse refs/remotes/origin/abc gets its information from .git/packed-refs—a data file containing strings—it gets the "right" information. But when it gets its information from the file system, an attempt to open .git/refs/heads/abc or .git/refs/remotes/origin/abc actually opens .git/refs/heads/Abc (if that file exists right now) or the similarly-named remote-tracking variant (if that file exists), and Git gets the "wrong" information.
Setting core.ignorecase (to anything) does not help at all as this affects only the way that Git deals with case-folding in the work-tree. Files inside Git's internal databases are not affected in any way.
This whole problem would never come up if, e.g., Git used a real database to store its <reference-name, hash-ID> table. Using individual files works fine on Linux. It does not work fine on Windows and MacOS, not this way anyway. Using individual files could work there if Git didn't store them in files with readable names—for instance, instead of refs/heads/master, perhaps Git could use a file named refs/heads/6d6173746572, though that halves the available component-name length. (Exercise: how is 0x6d m, 0x61 a, and so on?)
1Technically, this is the wrong word. It's sure descriptive though. A better word might be schizoid, as used in the title of one episode of The Prisoner, but it too has the wrong meaning. The root word here is really schism, meaning split and somewhat self-opposed, and that's what we're driving at here.

On Git, branches are just pointers to a commit. The branches are stores as plain files on your .git repository.
For instance you may have abc and def files on .git/refs/heads.
$ tree .git/refs/heads/
.git/refs/heads/
├── abc
├── def
└── master
The content of these files is just the commit number on which the branch is pointing.
I am not sure, but I think the option ignorecase is only relevant to your working directory, not the .git folder. So to remove the weird capitalized branches, you may just need to remove/rename the files in .git/refs/heads.
In addition to this, the upstream link from a local branch to a remote branch is stored on the .git/config file. In this file you may have something like:
[branch "Abc"]
remote = origin
merge = refs/heads/abc
Notice in this example that the remote branch is named Abc but the local branch is abc (lowercase).
To solve your issue I would try to:
Modify the .git/config file
Rename the corrupted branches in .git/refs/heads such as abc is renamed abc-old
Try your git pull

The answers supplied by nowox and torek were very helpful, but did not contain the exact solution. The existing references to remote in .git/config, and the files in git/refs/heads did not contain any versions of abc or def.
Instead, the problem existed in .git/refs/remotes/origin.
My .git/refs/remotes/origin directory had references to the lowercased versions of these feature branch folders. Some feature branches were made under abc and def using the lowercased versions, but they no longer exist on remote. The creator of these feature branches recently switched to using Abc and Def on remote. I deleted .git/refs/remotes/origin/abc and .git/refs/remotes/origin/def then executed fresh git pull -p commands. New folders, Abc and Def, were created, and subsequent pulls or fetches correctly display Already up to date.
Thanks to nowox and torek for getting me on the right track!

I did the following to solve my problem:
I navigated to the .git/refs/remotes/origin folder.
I deleted the folder with the buggy branch name.
I did git pull in the terminal.

I met the similar question today. I did the following to solve my problem:
rename the 2nd branch to another name
rename the 1st branch to 2nd_branch_old_name
git push origin 1st_branch_new_name

Merge lines in bash

I would like to write a script that restores a file, but preserving the changes that may be done after the backout file is created.
With more details: at some moment I create a backup of a file (file_orig). Do some changes to the original file as well(file_my_changes). After that, the original file can be changed again (file_additional_changes), but after the restore I want to have the backup file, plus the additional changes (file_orig + file_addtional_changes). In general backing out my changes only.
I am talking about grub.cfg file, so the expected possible changes will be adding or removing parts of a line.
Is it possible this to be done with a bash script?
I have 2 ideas:
Add some comments above the lines I am going to change, and then before the resotore if the line differ from the one from the backed out file, to read the comment, which will tell me what exactly to remove from the line;
If there is a way to display only the part of the line that differs from the file_orig and file_additional_changes, then to replace this line with the line from file_orig + the part that differs. But I am not sure if this is possible to be done at all.
Example"
line1: This is line1
line2: This is another line1
Is it possible to display only "another"?
Of course any other ideas are welcome!
Thank you!

Unclear, but perhaps if you're using a bash script you could run a diff on the 2 edited file and the last one and save that output someplace that you want to keep it? That would mean you have a copy of the changes.
Or just use git like everybody else.

One possibility would be to use POSIX commands patch and
diff.
Create the backup:
cp operational-file operational-file.001
Edit the operational file.
Create a patch from the differences:
diff -u operational-file.001 operational-file > operational-file.patch001
Copy the operational file again.
cp operational-file operational-file.002
Edit the operational file again.
Create a new patch
diff -u operational-file.002 operational-file > operational-file.patch002
If you need to recover but skip the changes from patch.001, then:
cp operational-file.001 operational-file
patch -i patch.002
This would apply just the second set of changes to the original file, as log as there's no overlap.
Consider using a version control system to keep records of the file changes. Consider using date/time stamps instead of version numbers on the file names.

SVN checkout files that have been committed within given time period

I'm writing a deployment bash script that will publish recent changes in source control to a different machine. I'm new to svn from the command line (have used it in development for years) and new to bash scripting.
I need a way to checkout only files that have been modified recently. Something like this:
svn checkout svn://server/repo/project/trunk -mtime -1d4h
The idea being this would only checkout files that have been committed within the last 28 hours.

You can't checkout changes, you get only some state of some part of repository (i.e "revision"), which will include all files, existing in this node (subdirectory) in this revision (and added|modified in any revision before this revision, including this revision)
Date format specification in Subversion doesn't allow "relative date-time", only absolute values
Scripts, which export|save outside repository all files, changed in revision|revision range, exist and can be found in Net (Subversion Command Line Script to export changed files as good bash-sample)
Consequences of the above notes
You must to supply correct Subversion-style date or revision number for starting point
Relative date with free-form specification can be easy constructed with bash date command (-d "28 hours before") and stored in variable, which can be used as parameter for electrictoolbox's script
Deploy of files from export-directory to final destination is final part (heavy environment-specific, no suggestions here now)

I found something simpler to suite my needs based on this:
How can I keep the original file [commit] timestamp on Subversion?
I checkout or update a working copy of the project using --config-option config:miscellany:use-commit-times=yes which sets the timestamps on the filesystem equal to the last checkout time. I then use a standard find command. For example:
#!/bin/bash
function listfn {
while read file; do
if [[ !( $file =~ ^.*\.svn.*$ ) ]]; then
echo $file
fi
done
}
svn checkout --config-option config:miscellany:use-commit-times=yes svn://server/repo/project/trunk project-fordeploy
find project-fordeploy -mtime -1d4h | listfn

Set svn:ignore recursively for a directory structure

I have imported a huge hierarchy of maven projects into IntelliJ idea, now idea has created .iml projects on all different levels. I would like to svn:ignore these files.
There are several similar questions, for example this one: svn (subversion) ignore specific file types in all sub-directories, configured on root?
However: the answer is always to apply svn:ignore with the --recursive flag. That's not what I am looking for, I just want to ignore the files I created, not set a property on the hundreds of directories underneath.
Basically, what I have is the output of
svn status | grep .iml
which looks like this:
? foo/bar/bar.iml
? foo/baz/phleem.iml
? flapp/flapp.iml
etc.
What I would like to do is for each entry dir/project.iml to add svn:ignore *.iml to dir. I am guessing I have to pipe the above grep to sed or awk and from there to svn, but I am at a total loss as to the exact sed or awk command on one hand and the exact svn command (with which I won't override existing svn:ignore values) on the other hand.
Update: what I am also not looking for are solutions based on find, because possibly there are projects in this hierarchy where .iml files are in fact committed and I wouldn't want to interfere with those projects, so I'm looking to add the property only for .iml files my IDE has created.

You can set up your client to globally ignore given file extensions. Just add
global-ignores = *.iml
into your Subversion config file.
Update: If you want to only ignore iml files in the directories involved, you can try
svn status | grep '^\?.*\.iml' | sed 's=^? *=./=;s=/[^/]*$==' | xargs svn propset svn:ignore '*.iml'

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio