Best practice for placement of large test datasets?

Best practice for placement of large test datasets? - maven

I'm dealing with an enormous amount of data (say, video) and most integration tests require at least a decent subset of this data.
These test files (subsets) can range from 200MB to 2GB.
Where would be a good place to put these files? Ideally they would not go directly into our version control system because people shouldn't have to download 5GB+ of test data every time they want to check out the project.
The test data needs to be updated by Jenkins whenever a schema change occurs (we already have this part figured out), so either maven or svn would need to download the latest version if anybody wanted to run the integration tests.
It would be great if it could be on-demand since we never run all the tests at once locally (e.g., if we are running TestX, then download the files required for this test before running).
Does anybody have any suggestion(s) on how to approach this?
Edit -- For the sake of simplicity let's say that the test files are incompressible.

In this case I would setup a file server share, that contains all the test data in a nicely organized way. Then let your test download the necessary test data itself. The advantage is that you can update the test data in the central place without updating the tests themselves. The next time the tests run, the new testdata will be downloaded.
If you need versioning, you would could use a repository manager like Nexus instead of a simple filesystem. If you need audit-ability, I would suggest a repository manager like subversion. However, make sure that you use a separate repo just for your testdata, so you can easily clean out the repo by replacing it with an empty repo that gets only the newest testdata loaded.

Related

how to stop old pact files being merged with new ones, without having to rely on my colleagues knowing to clean the target/pacts folder?

Pact merges pacts at the file level, this is great for merging pacts from multiple tests, but not so great when you want to modify and re-run a test without cleaning the target/pacts folder.
The default junit run config in intellij doesn't clean the target folder before running the tests; I know I can use maven clean/remove the files manually, but this means anyone else who runs these tests locally needs to know to run them a specific way.
I want to merge pacts from multiple tests so I don't want to turn off merging.
I tried implementing a before method that deletes files from the pact folder if they exist, but it was janky.
I'm considering setting the pact folder to a temporary directory that removes itself after the tests are run, but that might interfere with pushing new pacts to the broker, and I don't want to remove the folder too soon/often and end up with missing pacts. Also it's useful to be able to see the files at the end, so auto-removing them isn't ideal.
Is there a nice way to stop old pacts merging with new ones, without relying on people to just know they need to remove old pact files before running a modified test?

Why is it an actual problem for you? As in, yes, the pact file is temporarily bigger than it should be, but what is the actual impact?
You shouldn't be publishing from your local machine anyway, that is a CI concern (I usually restrict by not providing write credentials to local environments) so if all you need is to be able to rerun a unit test then I wouldn't worry.
Alternatively, if you are all using the same IDE you could create an IDE specific config and check it in to the repo that cleans the dir before any target / test is run.

How can I collect the output from CI?

I don't know how to collect the data from each build machine on CI. (I use TeamCity for CI and this is the first time to use CI by myself.)
After building code and running the .exe file, an output file is generated. It is a .csv file and its size is less than 1KB and very simple. I want to collect the data to one place and do some statistics.
The build and running .exe file is working fine. However, I don't know the next step. I have two ideas.
(Idea 1) Set-up a log database server (e.g. kibana-elastic search) and send the output to it. However, it seems an overkilling solution.
(Idea 2) Create a batch file and just copy the log to somewhere.
However, I don't know what is a usual way to use CI and collect the data. I guess there will be a better solution. Is there any way to collect the data by using CI?

I can suggest using build artifacts: you can configure your builds so that they will produce and make some files available for the users of Teamcity. Then you can download them and analyze as you need. Taking into account that files are pretty small, I think it's an ideal variant.
If you need to collect all artifacts from every build, you can configure another build, which would run some python script, which in turn would utilize Teamcity REST API to collect all artifacts from specific build and zip and produce complete set of your files.
As an example you can check some build at JetBrains test server: just select finished build and navigate to Artifacts tab.
Please ask more questions if my answer is not clear enough.

VSTS minimize get source task

We have a VSTS build setup. Currently we have a single repository hosting multiple services. We have a build definition per service, and triggers each only when the current service is touched, by a trigger pattern.
Now the issue is that each build definition hence the single repository makes the GetSource download the whole repo and also we do a clean.
I have been searching to see if there is a solution like the trigger where we can set a pattern, to just get a part of the repository downloadet. This should be to reduce build/download time.
A workaround might be to not make a clean each time or make multiple repositories. At the moment we would like to avoid the latter.
Let me hear if anyone knows of a good solution.

There is no way to specify the files to be downloaded in Get Sources step. Instead, VSTS will download the whole repo.
The workaround is set Clean option as false so that only the modified files and new added files will be updated in Get Sources step.

is it bad form to have your continuous integration system commit to a repository

I have recently been charged with building out our "software infrastructure" and so I am putting together a continuous integration server.
After a build completes would it be considered bad form for the CI system to check in some of the artifacts it creates into a tag so that it can be fetched easily later (or if the build breaks you can more easily recreate the problem.)
For the record we use SVN and BuildMaster (free edition) here.
This is more of a best practices question rather than a how-to question. (It is pretty easy to do with BuildMaster)
Seth

If you believe this approach would be beneficial to you, go ahead and do it. As long as you maintain a clear trace of what source code was used to build each artifact, you'll be fine.
You should keep this artifact repository separated from the source code repository.
It is however a little odd to use a source code repository for this - these are typically used for things that will change, something your artifacts most definitely should not.
Source code repositories are also often used in a context where you want to check out "everything", for example the entire trunk. With artifacts you are typically looking for a specific version, and checking out all of the would only be done if exporting them to some other medium.
There are several artifact repositories specialized for this, for example Artifactory or Apache Archiva, but a properly backed up file server will thought-through access settings might be a simple and good-enough solution.

I would say it's a smell to check in binaries as a tag. Your build artifacts should be associated with a particular build version in your build system, and that build should be associated with a particular checkin. You should be able to recreate the exact source code from that information. If what you're looking for is a one-stop-function to open the precise source-code revision that generated the broken build, I'd suggest that you invest some time into building a Powershell module that will do that for you.
Something with a signature like:
OpenBuild -projectName "some project name" -buildNumber "some build number"

How to take backup of StarTeam project

I have a project repositiory on Start Team server.
I need to take regular back up the same.
How can I achieve this?

The Star team backup steps are given in the Appendix C of the “The StarTeam Administrator’s Guide.pdf”

It depends on what you mean by backing up the Project. If you mean backing up the entire repository then StarTeam makes this really easy. You just need a snapshot of the DB and a full copy of the repository files (full steps are documented.) However, if you mean backing up a specific Project in the repository, and ONLY that Project, with all history intace, then this is not currently possible--or at least it is a major challenge.
StarTeam used to have an ability to Import/Export projects but they discontinued support and development of that tool years ago. If you wish to back up a single Project independent of the rest of the server, then this is still possible, and useful in the case where you want to split the repository into a separately managed repository. Here is how to do that:
Create a duplicate repository including all of the repository files.
Delete everything from the clone except for the Project(s) that you want to split off -- note that in StarTeam 2011 the Project Delete was broken, so you may need to do this in a direct SQL query which marks the projects/views as deleted. Contact Support if you run into problems deleting manually, especially if you have a large repository.
Once your clone has been pruned of unnecessary projects, run the Online Purge tool until all projects and respective files have been removed from the DB and the Vault.
You can now change what you need to change on the new repository, such as the users, groups, security, etc. without affecting the first repository.
Once you have validated the new repository is working properly, you can then run a similar process on the first repository to get rid of the projects that were split off.
Another potential use for this is if you had reached end of life for a project and you wanted to keep it offline and backed up but wanted it to be restorable with full history on demand (for regulatory purposes, etc.) while being freed up to remove it from the active repository so you can make other projects run faster. Though this is probably best done in batches of projects as the process is currently quite labor intensive to perform.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio