Project structure. Scientific Python projects - pip

I am looking for a better way to structure my research projects. I have the following setup:
There are projects a,b,c and a library lib. Each project tackles a different research question and the library carries code that is used across projects. Thus all projects depend on lib. Things get more complicated as project c depends on projects a and b as well. When I work on project c, I will also update a,b or lib simultaneously. Each project is in a separate git repository.
So far I have dealt with this situation by including the dependencies above via git submodule and all the source files are located in the root dir of the project. The advantage is that I keep track of which version of lib my projects depend. Also one of my projects could depend on an outdated version of lib. I run everything from the root directory without "installing" any of the packages to site-packages or so. When a path is not set correctly, I override it via sys.path.insert.
However, the following points make me want to change layout:
I keep losing track of which version of lib I am editing.
I want to make use of automated testing tools (tox,jenkins etc.) which seem to be much easier to handle with a standard project setup.
sys.path.insert can lead to subtle problems which are hard to debug.
I usually want all my projects to work with the tip of lib anyway.
Therefore I am currently rearranging all projects (especiall lib) to be in line with the standard Python directory structure (source stored in a subdirectory, root contains a setup.py file) to be able to work in a virtualenv. Then I can list all my dependencies in requirements.txt. First I install lib as develop via pip install -e . Then I run pip freeze > requirements.txt which then includes a line similar to this.
-e git+<path_to_remote>#<sha>#egg=`lib`
So again I have generated a dependency to a specific commit (sha) as with git submodule, ensuring that I can checkout an old commit and the project should run. I can now install everything in a virtualenv and got rid of my path problems. Great.
I face some new trouble though. One problem is, how to update the sha in requirements.txt. The easiest (but probably not most elegant) solution I see is to write a pre-commit hook that updates the sha before commiting. Is there a better way?
And more generally - do you see a better solution given my setup?

As far as I see you have mostly solved your problem and there are only small bits left.
1) Don't use hashes to identify versions of your libraries. Even if you don't publish your libraries to the Cheese Shop, do a normal library versioning (semver) and tag you git repositories accordingly. Thing way you will have human-readable and manageable version in your git+https://github.com/... URLs of dependencies.
2) Make your tox setup in the way that will let you test stable version of dependencies (that you have tagged last time) and master version right from the latest repo revision.

Related

How to store go dependencies?

I am using GoDep to resolve a project dependencies.
My problem is that repositories for dependencies maight be removed and my project wouldn't build.
I am trying to find any solution to store dependencies at Artifactory or another solution.
Please advice.
Regards.
Okay so GoDeps may be the standard way of doing this, but I usually found it a bit complicated. In my opinion, use a Makefile which sets a custom GoPath and just include dependencies with your code (remove their .git folder). This way the version freezes and no one needs to do a godep restore or something similar.
You can make recipes like make deploy that builds your code, runs GoFmt, cleans the pkg files, installs it to your custom GoPath bin/ and then you just go and run the binary.
You can have another one like make install that will install any missing dependencies.
I've managed to create a watch using this on my Makefile to keep on looking for changes on a linux based system using inotify-tools and call rebuild.
Internally all commands will be using standard go commands but you'll get rid of the GoDeps and maintaining JSON. To upgrade a dependency, it may be a bit of a problem as you'd have to manually copy the whole directory into your custom path and remove the .git/ folder.
Our company uses this method and seems to work quite nice for us.
Plus this method basically gets you away the $GOPATH/src/github.com/repoName/ kind of paths.
If i seem unclear, let me know, I'll add a gist on github.

How to deal with Git Submodules in Visual Studio solutions with different layout?

We develop with Visual Studio 2010 (in C#) and migrated a while ago from SVN to GIT. Now we try to split up our repository (which is quite big - ~30.000 files) to many git repositories - one for each solution.
The solutions share some projects, mostly libraries we develop in-house and like to add to from all the solutions.
The new repositories have a flat layout. One subdirectory for each project (shared projects are submodules).
In the big old repo, the projects are in a tree structure.
The Problem occurs with external references in the submodules. In the new repos, the path to a referenced project may be "......libs\someproject", while in the new layout the correct path would be "..\someproject".
We already had some edit wars concerning this and are not keen on more.
Half-baked Solutions I could think of:
use "Reference Paths" in ...csproj.user and exclude this file from version control (has to be redone for each developer and after each reopsitory cleanup)
use branches for each situation and try to teach everyone where "real" commits should go and where "environment-change" commits should go (submodules are already not the simplest concept...)
embed binaries instead of the submodules (but what about developing changes to the submodules? what about different log4net versions?)
Does anyone know of a sane solution?
Since you are asking for a sane solution, I can only advise you to look into setting up your own NuGet service (look at http://www.MyGet.org for inspiration)
http://nuget.codeplex.com/
IF you go down the route of package management, consider OpenWrap. However, embedding the package management artefacts in source code is a bad idea. You can use such tools to update what is actually stored in submodules, but don't rely on them at build time. Expect the binaries to be there from the point of view of your build scripts.
So if I understand you correctly, the problem is with Visual Studio and not with Git? If that's the case, use the old tree structure that worked with Visual Studio. Make your submodules structure a tree structure too. So the top of the tree would be one super repo whose sub modules (the branches) would have submodules of their own, until you get down to the leaves of your tree. It would be a pain to setup at first, but it should just work.
Use one submodule to house all "common libraries". Just one level. But you should move the common libraries as services with well defined contracts. This way you can incrementally rollout new versions with no down time. This way you only have a submodule in each that holds the contracts. These could be interfaces or messages.
I have a similar problem using VS 2013.
I want to use git-svn instead of SVN directly. SVN has a gigantic set of directories. I could not create a single git-repository that would contain all of our trunk folder. Git-always exited with an error and the repository was corrupted. I worked around the problem by doing as follows:
Using git-svn, I cloned the subset of folders off SVN/trunk that I needed by creating one git-repository per folder.
Created a local parent git repository that contains all my git-svn-cloned folders.
Each git-repository was added as a sub-module to the parent git-repository.
The problem with Visual Studio is that it does not recognize the multiple projects outside the main project where I opened the solution. This solution is in a folder that contains the only files recognized by Visual Studio as being under git-source control.
I tried setting the git-preferences to use the upper level parent directory as the location of the git-repostitory without noticing any difference.

Perl and Ruby modules in the same repository?

I've started working on a new Perl module and I've decided that I want to make a Ruby version of it as well (once I finish the Perl version). Do people tend to make separate repositories for each language? Or put them in the same repository?
I can easily see how the two sets of code are different enough to be treated as separate projects. But at the same time it's the same functionality written in two languages, so from that perspective it seems like a single project with two language ports.
What's considered best practice in this situation?
FWIW, I'm using git.
EDIT: I should be more clear here. These aren't modules in the sense of git submodules. They're modules that will be submitted to CPAN and RubyGems. Users of this project will likely be installing it via cpan or gem and then using/requiring it in the normal fashion.
In the course of my group's research, we have a couple repos, some with different technologies in each. We divide the repos by research question and checkout only the projects we are working on, with the repos having a uniform hierarchal directory structure that is the same for all projects. Since we already know the repo directory structural, running scripts and finding data becomes much easier.
I would recommend taking the same approach. The higher the division between the two technologies, the easier it will be to contribute to one of them without being confused by the presence of the other.
In the end ask yourself this: If I were to add another language, would I still keep it in one repo? If the answer is yes, keep doing what you're doing. If not, keep these libraries in two separate repos and manage the projects and contributers distinctly.
my experience in this kind of case is to have 2 smaller git repos for each of the modules and either cloning one branch into the consumer projects repo makes it quite simple. another way is to create a naked clone from the module's repo into the consumer projects repo, then just keep updating it as each module's development progresses. the consumer project should ignore the injected repos.
once other dev clones module A, and/or B, then he/she can just push to consumer project, as permissions allow. this is either a pro or a con depends.

Jenkins + Cmake + JIRA = CI of multiple interdependent projects?

We have a number of small projects within our system running on Linux (Slackware 7-11, slowly migrating to RHEL 6.0). Around 50-100 applications and 15-20 libraries. Almost all our applications use one or more of our libraries. Our source tree looks something like this:
/app1
/app2
/app3
/include
/foo/app4
/foo/app5
/foo/app6
/foo/lib1
/foo/lib2
/lib/lib3
/lib/lib4
/lib/include
Now, I've done some work creating some CMakeLists.txt files and built most of the libs and some of the apps. I'm fairly comfortable with using cmake to build. I did this with v2.6, and I recently (an hour ago) upgraded to 2.8. Each of the above projects have their own CMakeLists.txt file specific to the project to do building and installation (no packaging, yet).
I have a requirement to make use of and enforce continuous integration. I've installed and played around with Jenkins, and from what I've seen I'm very impressed. I'm also evaluating JIRA to do our issue tracking.
Just to get things up and going, I've done a cmake install on all the libs, so the apps can find them in the filesystem. Headers are installed to /usr/local/include and libs to /usr/local/lib. Is this a bad thing to do? Would it be better to tell cmake to look for the lib's source directory, use the export interface or the recently introduced ExternalProject_Add?
Because I'm going to be using Jenkins, I cannot be guaranteed that cmake can find the source or build directory. Of course, I can tell Jenkins to build the projects in order (or at least, build the dependencies first). If an update to a library breaks the building of another project, then I guess it'll be up to someone with 3/4 of a wit to determine this.
Thank you in advance
Just to get things up and going, I've done a cmake install on all the libs, so the apps can find them in the filesystem. Headers are installed to /usr/local/include and libs to /usr/local/lib. Is this a bad thing to do?
No it is not a bad thing to do, but your build should reproduce resources from scratch. Things like portability and fixing build bugs will become an issue if things need to be pre-installed in the system outside of the build process. If you are able to do it other ways as you mentioned I would suggest that way, but if its going to make your build that much longer, its something you need to feel out. My ideology is everything should be movable to a new Jenkins machine with a fresh install at the drop of a hat, again this always isn't achievable, but something to strive for.
Because I'm going to be using Jenkins, I cannot be guaranteed that
cmake can find the source or build directory. Of course, I can tell
Jenkins to build the projects in order (or at least, build the
dependencies first). If an update to a library breaks the building of
another project, then I guess it'll be up to someone with 3/4 of a wit
to determine this.
Well one of the things I do in interdependent jobs is that on the successful building of one jobs triggers the job that depends on it. So for example if A depends on B, and A fail, B will never be run and whoever created the issue in build A is responsible for it and so on. This prevents a cascading affect of broken build that all were caused by a broken dependency. I would suggest that you keep files in a particular build in its job folder and specify to the dependency the location of the required files. Again keep your builds separate and clean.
I'm also evaluating JIRA to do our issue tracking.
I highly recommend JIRA as an issue tracking system for company; You might want to look at this Jenkins plugin for integration. If your using git, and you dont mind hosting your code off site, I would GitHub issues a shot as well.
Goodluck you seem to be on the right track.

What is the best practice for sharing a Visual Studio Project (assembly) among solutions

Suppose I have a project "MyFramework" that has some code, which is used across quite a few solutions. Each solution has its own source control management (SVN).
MyFramework is an internal product and doesn't have a formal release schedule, and same goes for the solutions.
I'd prefer not having to build and copy the DLLs to all 12 projects, i.e. new developers should to be able to just do a svn-checkout, and get to work.
What is the best way to share MyFramework across all these solutions?
Since you mention SVN, you could use externals to "import" the framework project into the working copy of each solution that uses it. This would lead to a layout like this:
C:\Projects
MyFramework
MyFramework.csproj
<MyFramework files>
SolutionA
SolutionA.sln
ProjectA1
<ProjectA1 files>
MyFramework <-- this is a svn:externals definition to "import" MyFramework
MyFramework.csproj
<MyFramework files>
With this solution, you have the source code of MyFramework available in each solution that uses it. The advantage is, that you can change the source code of MyFramework from within each of these solutions (without having to switch to a different project).
BUT: at the same time this is also a huge disadvantage, since it makes it very easy to break MyFramwork for some solutions when modifiying it for another.
For this reason, I have recently dropped that approach and am now treating our framework projects as a completely separate solution/product (with their own release-schedule). All other solutions then include a specific version of the binaries of the framework projects.
This ensures that a change made to the framework libraries does not break any solution that is reusing a library. For each solution, I can now decide when I want to update to a newer version of the framework libraries.
That sounds like a disaster... how do you cope with developers undoing/breaking the work of others...
If I were you, I'd put MyFrameWork in a completely seperate solution. When a developer wants to develop one of the 12 projects, he opens that project solution in one IDE & opens MyFrameWork in a seperate IDE.
If you strong name your MyFramework Assemby & GAC it, and reference it in your other projects, then the "Copying DLLs" won't be an issue.
You just Build MyFrameWork (and a PostBuild event can run GacUtil to put it in the asssembly cache) and then Build your other Project.
The "best way" will depend on your environment. I worked in a TFS-based, continuous integration environment, where the nightly build deployed the binaries to a share. All the dependent projects referred to the share. When this got slow, I built some tools to permit developers to have a local copy of the shared binaries, without changing the project files.
Does work in any of the 12 solutions regularly require changes to the "framework" code?
If so your framework is probably new and just being created, so I'd just include the framework project in all of the solutions. After all, if work dictates that you have to change the framework code, it should be easy to do so.
Since changes in the framework made from one solution will affect all the other solutions, breaks will happen, and you will have to deal with them.
Once you rarely have to change the framework as you work in the solutions (this should be your goal) then I'd include a reference to a framework dll instead, and update the dll in each solution only as needed.
svn:externals will take care of this nicely if you follow a few rules.
First, it's safer if you use relative URIs (starting with a ^ character) for svn:externals definitions and put the projects in the same repository if possible. This way the definitions will remain valid even if the subversion server is moved to a new URL.
Second, make sure you follow the following hint from the SVN book. Use PEG-REVs in your svn:externals definitions to avoid random breakage and unstable tags:
You should seriously consider using
explicit revision numbers in all of
your externals definitions. Doing so
means that you get to decide when to
pull down a different snapshot of
external information, and exactly
which snapshot to pull. Besides
avoiding the surprise of getting
changes to third-party repositories
that you might not have any control
over, using explicit revision numbers
also means that as you backdate your
working copy to a previous revision,
your externals definitions will also
revert to the way they looked in that
previous revision ...
I agree with another poster - that sounds like trouble. But if you can't want to do it the "right way" I can think of two other ways to do it. We used something similar to number 1 below. (for native C++ app)
a script or batch file or other process that is run that does a get and a build of the dependency. (just once) This is built/executed only if there are no changes in the repo. You will need to know what tag/branch/version to get. You can use a bat file as a prebuild step in your project files.
Keep the binaries in the repo (not a good idea). Even in this case the dependent projects have to do a get and have to know about what version to get.
Eventually what we tried to do for our project(s) was mimic how we use and refer to 3rd party libraries.
What you can do is create a release package for the dependency that sets up a path env variable to itself. I would allow multiple versions of it to exist on the machine and then the dependent projects link/reference specific versions.
Something like
$(PROJ_A_ROOT) = c:\mystuff\libraryA
$(PROJ_A_VER_X) = %PROJ_A_ROOT%\VER_X
and then reference the version you want in the dependent solutions either by specific name, or using the version env var.
Not pretty, but it works.
A scalable solution is to do svn-external on the solution directory so that your imported projects appear parallel to your other projects. Reasons for this are given below.
Using a separate sub-directory for "imported" projects, e.g. externals, via svn-external seems like a good idea until you have non-trivial dependencies between projects. For example, suppose project A depends on project on project B, and project B on project C. If you then have a solution S with project A, you'll end up with the following directory structure:
# BAD SOLUTION #
S
+---S.sln
+---A
| \---A.csproj
\---externals
+---B <--- A's dependency
| \---B.csproj
\---externals
\---C <--- B's dependency
\---C.csproj
Using this technique, you may even end up having multiple copies of a single project in your tree. This is clearly not what you want.
Furthermore, if your projects use NuGet dependencies, they normally get loaded within packages top-level directory. This means that NuGet references of projects within externals sub-directory will be broken.
Also, if you use Git in addition to SVN, a recommended way of tracking changes is to have a separate Git repository for each project, and then a separate Git repository for the solution that uses git submodule for the projects within. If a Git submodule is not an immediate sub-directory of the parent module, then Git submodule command will make a clone that is an immediate sub-directory.
Another benefit of having all projects on the same layer is that you can then create a "super-solution", which contains projects from all of your solutions (tracked via Git or svn-external), which in turn allows you to check with a single Solution-rebuild that any change you made to a single project is consistent with all other projects.
# GOOD SOLUTION #
S
+---S.sln
+---A
| \---A.csproj
+---B <--- A's dependency
| \---B.csproj
\---C <--- B's dependency
\---C.csproj

Resources