Point of Artifact Repositories For Internal Code - continuous-integration

Imagine for a cloud based solution, a good portion of the deployed code is developed internally. My question is what is the point of using an Artifact Repository for internal code where you could always build whatever version directly from the source code?
In other words, doesn't it make more sense to spend the time on the build server to facilitate ease pf building desired artifact versions from the code vs adding an Artifact Repository like Nexus to feed build artifacts to deployments?

In theory yes, if you can be certain
everything that went into an artifact is checked in such as sources, data files
the exact environment (OS, compiler, linker, tools) used to built your artifact can be restored perfectly (snapshot of virtual machine)
nothing was forgotten
EDIT
In practice, as Mark O'Conner notes, even then two builds will normaly not be identical because they typically include timestamps and checksums depending on the former. You would have to somehow manually fix those during the build or somehow exactly reproduce time and timing on your build computer.
Otherwise you might face the situation that you can not (exactly) rebuild a certain Artifact. I prefer to have everything published to be stored in safe place.

The Continuous Delivery book calls the practice of building a binary more than once an antipattern:
This antipattern violates two important principles. The first is to
keep the de-ployment pipeline efficient, so the team gets feedback as
soon as possible. Recompiling violates this principle because it
takes time, especially in large systems. The second principle is to
always build upon foundations known to be sound. The binaries that get
deployed into production should be exactly the same as those that went
through the acceptance test process—and indeed in many pipeline
im-plementations, this is checked by storing hashes of the binaries at
the time they are created and verifying that the binary is identical
at every subsequent stage in the process.
Binary equality checking via hash may also be important for auditing purposes in highly regulated domains.

Related

How does CI affect semantic versioning?

In Countinous Delivery book, it's recommended to keep everything - including CI scripts - in the version control. Actually, current CI systems like gitlab CI already follow this rule of thumb and search for CI scripts in the same codebase.
On the other hand, we are versioning our codebase (and it's built artifacts) whenever it changes. And we follow semantic versioning for that; incrementing patch field for bugfixes, minor for non-breaking features, and so on...
And we make sure the version is incremented between commits by checking it in the CI.
But, there are commits that only change the CI scripts; i.e. adding an analysis job, optimizing another, etc.
My question, after this long boring preface, is that what is the best practice to versioning such changes to the CI? Since it possibly can affect the final built artifact (e.g. changing a build flag in the CI job for optimization or ...).
Is it ok to increment the version in this case?
Git is a revision control system. Every time you commit something to a git repo, it labels the content of the repo with a content hash value that represents that version of the repo. Semantic versioning of a git repo's content is redundant and pointless. The whole point of SemVer is to provide a means for producers to communicate risk to consumers. In other words, semantic versioning is intended for build product labeling, not the bits that go into producing the build.
If you attempt to apply SemVer semantics to the repo, you are labeling the product inputs, not the product itself. You should not apply a SemVer string until after all unit/regression/acceptance tests have been performed. How else can you have any certainty whether the code/build-script changes have broken anything?
Pre-build labeling cannot work. Build processes that are capable of reproducing the exact same output twice in a row, are extremely rare, if any exist at all. It is a violation of best practice to have multiple API's/packages in the world with the same SemVer string attached to them. If you label the repo content and then forward that label to the build output, every time you run the build, you produce a package with different content. There will always be some risk that more than one of those outputs will be released into the wild. Many security conscious consumers pay close attention to the content hash of packages they consume. Detecting that a particular producer has released multiple package hashes without bumping the version number, will raise red flags and lead to mistrust of that producers internal processes.
This is a very deep topic that can't be fully covered here. Other issues to consider are OS/Compiler/Tool chain updates. Will you also be committing the entire build tool chain to the same repo? This is an untenable approach, full of hazards I cannot fully enumerate, without taking a few months off work to document them.
Best practice:
Use semantic commit messages that clearly state the developer's intent.
Validate build outputs prior to packaging/labeling.
Always keep humans in the loop, for non-prerelease publications.
Just for clarity, let me add that maintaining build scripts and tool manifests in the repo is considered a best practice. It ties the versions of your scripts and tools, to the versions of the code you are building. Git does do this job quite well, by creating a commit hash that encompasses the state of the entire repo (minus the tags if I recall correctly). But there will be issues eventually, with older versions of tools, being withdrawn from file shares/feeds, particularly when they are found to create security vulnerabilities.
It will sometimes be the case, that older versions of your products, cannot be reproduced using the earlier build process. Checking in the binaries is often promoted as a fix for this issue, but I would argue that it's an anti-pattern. Binaries you are likely never going to want or need in the future, should not be stored in your repo. It just clogs everything up.
Consider using an alternate archival system. Maintaining a separate archive of older tools isn't a bad idea, but you will often find that you simply can't run them on current hardware and OS's, without significant reconfiguration of build machine(s) and re-introducing well known security risks. You should frequently prune such an archive, based on the latest known risks and weighing the cost of having to do some additional work, if/when the day ever comes, that you need to build from a really old commit hash.
It is better to maintain an up-to-date build system, that can build all of your code base, back to some reasonable point in its history. That point is usually the oldest bits that you are willing to actively support with bug fixes.
These days I'm using SemVer compatible Headver; https://github.com/line/HeadVer and feeling happy.
It is very CI friendly thanks to the automatic incremental versioning, but still be able to announce when breaking changes happens by allowing to define major version number manually.

Using CI to Build Interdependent Projects in Isolation

So, I have an interesting situation. I have a group that is interested in using CI to help catch developer errors. Great - we are all for that. The problem I am having wrapping my head around things is that they want to build interdependent projects isolated from one another.
For example, we have a solution for our common/shared libraries that contains multiple projects, some of which depend on others in the solution. I would expect that if someone submits a change to one of the projects in the solution, the CI server would try to build the solution. After all, the solution is aware of dependencies and will build things, optimized, in the correct order.
Instead, they have broken out each project and are attempting to build them independently, without regard for dependencies. This causes other errors because dependent DLLs might not yet exist(!) and a possible chicken-and-egg problem if related changes were made in two or more projects.
This results in a lot of emails about broken builds (one for each project), even if the last set of changes did not really break anything! When I raised this issue, I was told that this was a known issue and the CI server would "rebuild things a few times to work out the dependencies!"
This sounds like a bass-ackwards way to do CI builds. But maybe that is because I am older and ignorant of newer trends - so, as anyone known of a CI setup that builds interdependent projects independently? Any for what good reason?
Oh, and they are expecting us to use the build outputs from the CI process, and each built DLL gets a potentially different version number. So we no longer have a single version number for the output of all related DLLs. The best we can ascertain is that something was built on a specific calendar day.
I seem to be unable to convince them that this is A Bad Thing.
So, what am I missing here?
Thanks!
Peace!
I second your opinion.
You risk spending more time dealing with the unnecessary noise than with the actual issues. And repeating portions of the CI verification pipeline only extends the overall CI execution time, which goes against the CI goal of reducing the feedback loop.
If anything you should try to bring as many dependent projects as possible (ideally all of them) under the same CI umbrella, to maximize the benefit of fast feedback of breakages/regressions on any part of the system as a whole.
Personally I also advocate using a gating CI system (based on pre-commit verifications) to prevent regressions rather than just detecting them and relying on human intervention for repairs.

What are the consequences of always using Maven Snapshots?

I work with a small team that manages a large number of very small applications (~100 Portlets). Each portlet has its own git repository. During some code I was reviewing today, someone made a small edit, and then updated their pom.xml version from 1.88-SNAPSHOT to 1.89-SNAPSHOT. I added a comment asking if this is the best way to do releases, but I don't really know the negative consequences of doing this.
Why not do this? I know snapshots are not supposed to be releases, but why not? What are the consequences of using only snapshots? I know maven will not cache snapshots the same as non-snapshots, and so it may download the artifact every time, but let's pretend the caching doesn't matter. From a release-management perspective, why is using a SNAPSHOT version every time and just bumping the number a bad idea?
UPDATE:
Each of these projects results in a war file that will never be available on a maven repo outside of our team, so there are no downstream users.
The main reason for not wanting to do this is that the whole Maven eco-system relies on a specific definition of what a snapshot version is. And this definition is not the one you're setting in your question: it is only supposed to represent a version currently in active development, and it is not suppose to be a stable version. The consequence is that a lot of the tools built around Maven assumes this definition by default:
The maven-release-plugin will not let you prepare a release with a snapshot version as released version. So you'll need to resort to tagging by hand on your version control, or make your own scripts. This also means that the users of those libraries won't be able to use this plugin with default configuration, they'll need to set allowTimestampedSnapshots.
The versions-maven-plugin which can be used to automatically update to the latest release version won't work properly as well, so your users won't be able to use it without configuration pain.
Repository managers, like Artifactory or Nexus, comes built-in with a clear distinction of repositories hosting snapshot dependencies and release dependencies. For example, if you use shared Nexus company-wide, it could be configured to purge old snapshots so this would break things for you... Imagine someone depends on 1.88-SNAPSHOT and it is completely removed: you'll have to go back in time and redeploy it, until the next removal... Also, certain Artifactory internal repositories can be configured not to accept any snapshots, so you won't be able to deploy it there; the users will be forced, again, to add more repository configuration to point at those that do allow snapshots, which they may not want to do.
Maven is about convention before configuration, meaning that all Maven projects should try to share the same semantics (directory layout, versioning...). New developers that would access your project will be confused and lose time trying to understand why your project is build the way it is.
In the end, doing this will just cause more pain on the users and will not simplify a single thing for you. Probably, you could make it somewhat work, but when something is going to break (because of company policy, or some other future change), don't act surprised...
Tunaki gave a lot of reasonable points why you break Maven best practices, and I fully support that view. But even if you don't care about "conventions of other companies", there are reasons:
If you are not doing CI (and consider every build as potential release), you need to distinguish between versions which should go productive and those who are just for testing. If everything is SNAPSHOT, this is hard to do.
If someone (accidentally) deploys a second 1.88-SNAPSHOT, it will be the new 1.88-SNAPSHOT, hiding the old one (which is available by a concrete timestamp, but this is messy). Release versions cannot be deployed twice.

is it bad form to have your continuous integration system commit to a repository

I have recently been charged with building out our "software infrastructure" and so I am putting together a continuous integration server.
After a build completes would it be considered bad form for the CI system to check in some of the artifacts it creates into a tag so that it can be fetched easily later (or if the build breaks you can more easily recreate the problem.)
For the record we use SVN and BuildMaster (free edition) here.
This is more of a best practices question rather than a how-to question. (It is pretty easy to do with BuildMaster)
Seth
If you believe this approach would be beneficial to you, go ahead and do it. As long as you maintain a clear trace of what source code was used to build each artifact, you'll be fine.
You should keep this artifact repository separated from the source code repository.
It is however a little odd to use a source code repository for this - these are typically used for things that will change, something your artifacts most definitely should not.
Source code repositories are also often used in a context where you want to check out "everything", for example the entire trunk. With artifacts you are typically looking for a specific version, and checking out all of the would only be done if exporting them to some other medium.
There are several artifact repositories specialized for this, for example Artifactory or Apache Archiva, but a properly backed up file server will thought-through access settings might be a simple and good-enough solution.
I would say it's a smell to check in binaries as a tag. Your build artifacts should be associated with a particular build version in your build system, and that build should be associated with a particular checkin. You should be able to recreate the exact source code from that information. If what you're looking for is a one-stop-function to open the precise source-code revision that generated the broken build, I'd suggest that you invest some time into building a Powershell module that will do that for you.
Something with a signature like:
OpenBuild -projectName "some project name" -buildNumber "some build number"

Benefits of CI for highly modularized projects

There has been some discussion in abandoning our CI system (Hudson FWIW) due to the fact that our projects are somewhat segmented. Without revealing too much, you can think of each project as similar to a web site project: it has dependencies, its own unit tests, etc.
It seems like one of the major benefits of CI is to make sure that each component of a project works together, but aside from project inheritance most of our projects are standalone and unit tested fairly well.
Given what I have explained here (the oddity in our project organization); can anyone explain any benefits of CI for segmented\modular\many projects?
So far as I can tell, this is the only good reason I've found:
“Bugs are also cumulative. The more bugs you have, the harder it is to remove each one. This is partly because you get bug interactions, where failures show as the result of multiple faults - making each fault harder to find. It's also psychological - people have less energy to find and get rid of bugs when there are many of them - a phenomenon that the Pragmatic Programmers call the Broken Windows syndrome.”
From here: http://martinfowler.com/articles/continuousIntegration.html#BenefitsOfContinuousIntegration
I would use Hudson for the following reasons:
Ensuring that your projects build/compile properly.
Building jobs dependent on the build success of other jobs.
Ensuring that your code adheres to agreed-upon coding standards.
Running unit tests.
Notifying development team of any issues found.
If the number of projects steadily increases, you will find the need to be able to manage each one effectively, especially considering the above reasons for doing so.
In your situation, you can benefit from CI in (at least) these two ways:
You can let the CI server run certain larger test suites automatically after each subversion/... check-in. Especially those which test the interaction of different modules, hence the name continuous integration. This takes away the maintenance work and waiting time from the developers when they consider a check-in. Some CI (e.g. Hudson) also can be configured to automatically build modules when a depending module is build. This way you can let it automatically test if depending modules are compatible with the new version of the changed one.
You can let the CI server publish the new artifacts to the repository of a dependency resolver (e.g., Ivy, Maven). This way, the various modules can automatically download the latest (stable) revisions of the modules they depend on. Combine this point with the previous one and imagine the possibilities (!!!).

Resources