We have large number of tasks(~30) kicked off by SCDF on PCF, however we are running to disk space issues with SCDF, the issue appears to be due to SCDF downloading artifacts each time a task is invoked.
The artifacts in our case are downloaded from an rest endpoint https://service/{artifact-name-version.jar} (which inturn serves it from an S3 repository)
Every time a task is invoked, it appears that SCDF downloads the artifact (to ~tmp/spring-cloud-deployer directory)verifies the sha1 hash to make sure it's the latest before it launches the task on PCF
The downloaded artifacts never get cleaned up
It's not desirable to download artifacts each time and fill up disk space in ~tmp/ of SCDF instance on PCF.
Is there a way to tell SCDF not to download artifact if it already exists ?
Also, can someone please explain the mechanism of artifact download, comparing sha1 hash and launching tasks (and various options around it)
Thanks !
SCDF downloads the artifacts for the following reasons at the server side.
1) Metadata (application properties) retrieval - if you have an explicit metadata resource then only that is downloaded
2) The corresponding deployer (local, CF) eventually downloads the artifact before it sends the deployment request/launching request.
The hash value is used for unique temp file creation when the artifact is downloaded.
Is there a way to tell SCDF not to download artifact if it already exists?
The HTTP based (or any explicit URL based other than maven, docker) artifacts are always downloaded due to the fact that the resources in a specific URL can be replaced with some other resource and we don't want to use the cache in this case.
Also, We recently deprecated the use of cache cleanup mechanism as it wasn't being used effectively.
If your use case (with this specific disk space limitation can't handle caching multiple artifacts) requires this cleaning of cache feature, please create a Github request here
We were also considering the removal of HTTP based artifact after it is deployed/launched. Looks like it is worth to revisit that now.
Related
In the AWS CodeBuild pipeline I got this error during the build-image task:
could not download https://repo.spring.io/release/org/springframework/cloud/spring-cloud-bindings/1.10.0/spring-cloud-bindings-1.10.0.jar
The same build fail on my PC and the artifact spring-cloud-bindings-1.10.0.jar doesn't exists anymore on repo.spring.io.
Today the build-image task is working and the spring-cloud-bindings-1.10.0.jar is available from url https://repo.spring.io/release/org/springframework/cloud/spring-cloud-bindings/1.10.0/spring-cloud-bindings-1.10.0.jar.
The problem cause was a spring repo temporary unavailability.
This is an evidence this task doesn't use maven repository cache mechanism.
For future readers having the same or similar issues, you might want to read the relevant article from Spring team here.
Basically the sum up of the article is, that you better stop retrieving dependencies from repo.spring.io and switch to maven central instead.
Reason is that the instance of repo.spring.io was sponsored in the past from JFrog, Inc but as the situation has now changed they are basically moving to other instances.
I wrote a pipline to build my Java application with Maven. I have feature branches and master branch in my Git repository, so I have to separate Maven goal package and deploy. Therefore I created two jobs in my pipeline. Last job needs job results from first job.
I know that I have to cache the job results, but I don't want to
expose the job results to GitLab UI
expose it to the next run of the pipeline
I tried following solutions without success.
Using cache
I followed How to deploy Maven projects to Artifactory with GitLab CI/CD:
Caching the .m2/repository folder (where all the Maven files are stored), and the target folder (where our application will be created), is useful for speeding up the process by running all Maven phases in a sequential order, therefore, executing mvn test will automatically run mvn compile if necessary.
but this solution shares job results between piplines, see Cache dependencies in GitLab CI/CD:
If caching is enabled, it’s shared between pipelines and jobs at the project level by default, starting from GitLab 9.0. Caches are not shared across projects.
and also it should not be used for caching in the same pipeline, see Cache vs artifacts:
Don’t use caching for passing artifacts between stages, as it is designed to store runtime dependencies needed to compile the project:
cache: For storing project dependencies
Caches are used to speed up runs of a given job in subsequent pipelines, by storing downloaded dependencies so that they don’t have to be fetched from the internet again (like npm packages, Go vendor packages, etc.) While the cache could be configured to pass intermediate build results between stages, this should be done with artifacts instead.
artifacts: Use for stage results that will be passed between stages.
Artifacts are files generated by a job which are stored and uploaded, and can then be fetched and used by jobs in later stages of the same pipeline. This data will not be available in different pipelines, but is available to be downloaded from the UI.
Using artifacts
This solution is exposing the job results to the GitLab UI, see artifacts:
The artifacts will be sent to GitLab after the job finishes and will be available for download in the GitLab UI.
and there is no way to expire the cache after finishing the pipeline, see artifacts:expire_in:
The value of expire_in is an elapsed time in seconds, unless a unit is provided.
Is there any way to cache job results only for the running pipline?
There is no way to send build artifacts between jobs in GitLab that only keeps them as long as the pipeline is running. This is how GitLab has designed their CI solution.
The recommended way to send build artifacts between jobs in GitLab is to use artifacts. This feature always upload the files to the GitLab instance, that they call the coordinator in this case. These files are available through the GitLab UI, as you write. For most cases this is a complete waste of space, but in rare cases it is very useful as you can download the artifacts and check why your pipeline broke.
The artifacts are available for download by project members that are at least Reporters, but can be viewed by everybody if public pipelines is enabled. You can read more about permissions here.
To not fill up your hard disk or quotas, you should use an expire_in. You could set it to just a few hours if you really don't want to waste space. I would not recommend this though, as if a job that depend on these artifacts fails and you retry it, if the artifacts have expired, you will have to restart the whole pipeline. I usually put this to one week for intermediate build artifacts as that often fits my needs.
If you want to use caches for keeping build artifacts, maybe because your build artifacts are huge and you need to optimize it, it should be possible to use CI_PIPELINE_ID as the key of the cache (I haven't tested this):
cache:
key: ${CI_PIPELINE_ID}
The files in the cache should be stored where your runner is installed. If you make sure that all jobs that need these build artifacts are executed by runners that have access to this cache, it should work.
You could also try some of the other predefined environment variables as key our your cache.
I have access to a private Nexus Repository and would like to speed up my CI builds and thought that I could use the private repository to store and access my build cache. Is this a possibility or a dead end?
It works like a breeze.
Just create a "Raw" repository and give a user write permission for it.
This user then is used to fill the cache and you can use another user or anonymous access to read from the cache.
I just tried it minutes ago.
Any web server that supports PUT for storing files and GET for retrieving the same files should be fine with the default HttpBuildCache implementation.
You can even provide an own client-side implementation to use any remote service you want as build cache.
No.
Gradle's remote build cache is one of the selling points of Gradle Enterprise. So it's not something you can just "plugin" to another piece of software like Nexus.
There is however a Docker image that is designed to work with Gradle Enterprise. Maybe you could make use of that somehow.
But again, the remote build cache is a selling point of Gradle enterprise and as a result is designed to work with Gradle enterprise.
https://gradle.com/build-cache/
I have a nexus3 oss (3.13.0) docker container deployed in aws backed with a s3 blobstore. Our ci jobs are continuously uploading artifacts to this repo and worked just fine. However, off late uploading maven artifacts takes a long time and in some cases eventually fails.
It was originally version 3.12.0 and thought upgrading version might help, but it didn't. Also checked if it has anything to do with connectivity or permissions to s3 and found nothing.
Update:
Switched to a file based blobstore and the issue still persists, so we can at least rule out that it's not specific to s3 blobstore.
The repo size is greater than 20GB so increased the heap allocation as recommended in the documentation, but still did not help.
Any idea what might be happening?
Here's what I see in the logs on nexus3:
org.sonatype.nexus.blobstore.api.BlobStoreException: BlobId: null, Error uploading blob
at org.sonatype.nexus.blobstore.s3.internal.MultipartUploader.upload(MultipartUploader.java:98)
at org.sonatype.nexus.blobstore.s3.internal.S3BlobStore.lambda$0(S3BlobStore.java:220)
at org.sonatype.nexus.blobstore.s3.internal.S3BlobStore.create(S3BlobStore.java:257)
at org.sonatype.nexus.blobstore.s3.internal.S3BlobStore.create(S3BlobStore.java:217)
...
Caused by: org.eclipse.jetty.io.EofException: Early EOF
at org.eclipse.jetty.server.HttpInput$3.getError(HttpInput.java:1138)
... 122 common frames omitted
The solution was to set the right order of realms in the admin section. In my case the ldap realm was ordered before the local authentication and local authorizing realms, but users were actually connecting using a locally created user. So it would try to do a ldap lookup before a local lookup which was causing the delay in the authentication mechanism. Once the order was changed to move the local realms above the ldap realm, things got better and uploads were much faster.
Do you know a way to configure Nexus OSS so that it publishes the artifact repository to a remote server in a form that can be statically served, e.g. by Apache Httpd? I'd like to use this static copy to serve only my own artifacts, so the nexus server could actively trigger an update in case there is something new published.
Technically, I think it should be possible to create the metadata for the repo and store them in a static file, but I'm not sure with that. Any hints appreciated.
If there is another repo manager to achieve that, it would be fine for me as well.
I clearly understand the advantages to use the repo manager directly, but due to IT rules I can run Nexus only internally and it would be necessary to have these artifacts available in a (private) repo copy on the Internet as well.
A typical way to solve this IT requirement of only exposing known servers like Apache httpd is to setup Apache httpd as a reverse proxy as documented here.
You can use that approach in a more restrictive way by only exposing a specific repository or better repository group (so you can combine snapshots and releases) and tying that together with a specific user or a specifically restricted setup of the anonymous user that is used by default when no credentials are passed through.
Also if you need more help feel free to contact us in the user mailinglist or on hipchat.