Reuse a conda environment within the same gitlab CI pipeline? - caching

Motivation
My main goal is this: Within a pipeline, I would like to reuse as much as possible (i.e. not build the conda environment multiple times, if all jobs share the same environment).
In my project, I use conda as depency manager and gitlab ci/cd for continuous integration. For the sake of simplicity, let's say I have a build job and a test job. The most straight forward approach would be to create the conda environment from the environment.yml in any job and then do the actual work. This adds an overhead of several minutes to any job. It also seems like overhead to me, since I would like to build the environment once in the build job and then use it in my test job (especially when creating multiple jobs for different tests).
Research Results
The first thing I need to do is to set the CONDA_ENVS_PATH to somewhere in my project directory.
I've looked at gitlab's caching mechanism, but found that it only helps for the same job in repeated runs of the same pipeline, but not not for different jobs of the same run within a pipeline.
I've also looked at gitlab's artifacts mechanism, but found that due the up- and download of those, they don't increase run time significantly (basically I only save time by not downloading many small packages and not having to compile them again, but loose time by compressing and decompressing them).
I've also tried to make use of the GIT_CLEAN_FLAGS by setting them to none in my test job. That way, the conda environment is not deleted when getting the latest data from git. This does cause a serious speedup in my pipelines, but it does not work all the time. Some jobs fail, not finding the conda environment. A simple rerun does magically work however. Of course, in a CI / CD setting, this nondeterminism is not practical.
As a workaround to the original question, we've come up with an intermediate solution. We introduce a docker image, holding our custom environment, through a minimal Dockerfile and a couple of changes to our .gitlab-ci.yml (see below for an example). By only executing the job that builds our custom docker image when the dockerfile or environment changed, we save valuable time on each run. At the same time, we keep the full flexibility in our environment definition and can adjust it exactly how we usually would: by changing the environment.yml.
Question
All solutions tried so far are not really satisfactory. Thus my question is: How can my test job reuse the same conda environment as my build job in gitlab-ci?
In case someone else would like to use a similar setup: Here is my current approach:
# Dockerfile
FROM continuumio/miniconda3:latest
COPY environment.yml .
RUN conda env create -f environment.yml
ENTRYPOINT [""]
# .gitlab-ci.yml
# Use the latest version of this project's docker file
# This will be the default image for all jobs unless specified otherwise
image: $CI_REGISTRY_IMAGE:latest
# Change cache directories to be inside the project directory since we can
# only cache local items.
variables:
PRE_COMMIT_HOME: "${CI_PROJECT_DIR}/.cache/pre-commit"
stages:
- build
- test
# Make conda environment available to all jobs
# This expects the conda environment to have the same name as the gitlab project path
# Avoid dashes and other non-alphabetical characters
default:
before_script:
- source activate "${CI_PROJECT_NAME}"
# Build the docker image including the correct conda environment for subsequent jobs
# This assumes a docker image registry being configured for your gitlab instance
dockerimage:
stage: build
image:
name: gcr.io/kaniko-project/executor:debug
entrypoint: [""]
rules:
- changes:
- Dockerfile
- environment.yml
before_script: [ ]
script:
- mkdir -p /kaniko/.docker
- echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json
- /kaniko/executor --context $CI_PROJECT_DIR --dockerfile $CI_PROJECT_DIR/Dockerfile --destination $CI_REGISTRY_IMAGE:latest
# Run pytest
pytest:
stage: test
script:
- conda install pytest-cov
- pytest tests --cov=src
Edit: I replaced the example code for using GIT_CLEAN_FLAGS with our most recent approach: using a custom docker image.
Disclaimer: I saw this and this question, but they are both dated, don't have a satisfying answer and I only found them after writing this question, so I hope my additional question increases discoverability of the topic.

Related

Share a file between two workflows in CircleCI

In our repo build and deploy are two different workflows.
In build we call lerna to check for changed packages and save the output in a file saved to the current workspace.
check_changes:
working_directory: ~/project
executor: node
steps:
- checkout
- attach_workspace:
at: ~/project
- run:
command: npx lerna changed > changed.tmp
- persist_to_workspace:
root: ./
paths:
- changed.tmp
I'd like to pass the exact same file from build workflow to deploy workflow and access it in another job. How do I do that?
read_changes:
working_directory: ~/project
executor: node
steps:
- checkout
- attach_workspace:
at: ~/project
- run:
command: |
echo 'Reading changed.tmp file'
cat changed.tmp
According to this blog post
Unlike caching, workspaces are not shared between runs as they no
longer exists once a workflow is complete
it feels that caching would be the only option.
But according to the CircelCI documentation, my case doesn't fit their cache defintions:
Use the cache to store data that makes your job faster, but, in the
case of a cache miss or zero cache restore, the job still runs
successfully. For example, you might cache NPM package directories
(known as node_modules).
I think you can totally use caching here. Make sure you choose your key template(s) wisely.
The caveat to keep in mind is that (unlike the job level where you can use the requires key), there's no *native *way to sequentially execute workflows. Although you could consider using an orb for that; for example the roopakv/swissknife orb.
So you'll need to make sure that the job (that needs the file) in the deploy workflow, doesn't move to the restore_cache step until the save_cache in the other job has happened.

Cypress binary is missing and Gitlab CI pipeline

I'm trying to integrate cypress testing into gitlab pipeline.
I've tried about 10 different configurations which all fail.. I've included what I think are the relevant portions of of the gitlab.yml file, as well as the screenshot of the error on gitlab.
Thanks for any help
variables:
GIT_SUBMODULE_STRATEGY: recursive
cache:
paths:
- src/ui/node_modules/
- /root/.cache/Cypress/ //added this, also have tried src/ui/cypress/
build_ui:
image: node:16.14.2
stage: build
script:
- cd src/ui
- yarn install --pure-lockfile --prefer-offline --cache-folder .yarn
ui_test:
image: node:16.14.2
stage: test
needs: [build_ui]
script:
- cd src/ui
- yarn run runCypressHeadless
Each job gets its own separate environment. Therefore, you need to install your dependencies in each job. Add your yarn install command to the ui_test job.
The reason why your cache: did not restore to the job from the previous stage is because caches are per job by default (e.g. caches are restored from previous pipelines that ran the same job). If you want subsequent jobs in the same pipeline to use the cache, set the cache:key: to something like $CI_COMMIT_SHA or use cache:key:files: to use a file key, like your lockfile(s).
Also, you can only cache paths in the workspace. So you won't be able to cache/restore /root/.cache/... -- instead you should change the cache location to somewhere in the workspace.
For additional reference, see: caching in GitLab CI and caching NodeJS dependencies.

Gitlab CI : how to cache node_modules from a prebuilt image?

The situation is this:
I'm running Cypress tests in a Gitlab CI (launched by vue-cli). To speed up the execution, I built a Docker image that contains the necessary dependencies.
How can I cache node_modules from the prebuilt image to use it in the test job ?
Currently I'm using an awful (but working) solution:
testsE2e:
image: path/to/prebuiltImg
stage: tests
script:
- ln -s /node_modules/ /builds/path/to/prebuiltImg/node_modules
- yarn test:e2e
- yarn test:e2e:report
But I think there must be a cleaner way using the Gitlab CI cache.
I've been testing:
cacheE2eDeps:
image: path/to/prebuiltImg
stage: dependencies
cache:
key: e2eDeps
paths:
- node_modules/
script:
- find / -name node_modules # check that node_modules files are there
- echo "Caching e2e test dependencies"
testsE2e:
image: path/to/prebuiltImg
stage: tests
cache:
key: e2eDeps
script:
- yarn test:e2e
- yarn test:e2e:report
But the job cacheE2eDeps displays a "WARNING: node_modules/: no matching files" error.
How can I do this successfully? The Gitlab documentation doesn't really talk about caching from a prebuilt image...
The Dockerfile used to build the image :
FROM cypress/browsers:node13.8.0-chrome81-ff75
COPY . .
RUN yarn install
There is not documentation for caching data from prebuilt images, because it’s simply not done. The dependencies are already available in the image so why cache them in the first place? It would only lead to an unnecessary data duplication.
Also, you seem to operate under the impression that cache should be used to share data between jobs, but it’s primary use case is sharing data between different runs of the same job. Sharing data between jobs should be done using artifacts.
In your case you can use cache instead of prebuilt image, like so:
variables:
CYPRESS_CACHE_FOLDER: "$CI_PROJECT_DIR/cache/Cypress"
testsE2e:
image: cypress/browsers:node13.8.0-chrome81-ff75
stage: tests
cache:
key: "e2eDeps"
paths:
- node_modules/
- cache/Cypress/
script:
- yarn install
- yarn test:e2e
- yarn test:e2e:report
The first time the above job is run, it’ll install dependencies from scratch, but the next time it’ll fetch them from the runner cache. The caveat is that unless all runners that run this job share cache, each time you run it on a new runner it’ll install the dependencies from scratch.
Here’s the documentation about using yarn with GitLab CI.
Edit:
To elaborate on using cache vs artifacts - artifacts are meant for both storing job output (eg. to manually download it later) and for passing results of one job to another one from a subsequent stage, while cache is meant to speed up job execution by preserving files that the job needs to download from the internet. See GitLab documentation for details.
Contents of node_modules directory obviously fit into the second category.

Codebuild Workflow with environment variables

I have a monolith github project that has multiple different applications that I'd like to integrate with an AWS Codebuild CI/CD workflow. My issue is that if I make a change to one project, I don't want to update the other. Essentially, I want to create a logical fork that deploys differently based on the files changed in a particular commit.
Basically my project repository looks like this:
- API
-node_modules
-package.json
-dist
-src
- REACTAPP
-node_modules
-package.json
-dist
-src
- scripts
- 01_install.sh
- 02_prebuild.sh
- 03_build.sh
- .ebextensions
In terms of Deployment, my API project gets deployed to elastic beanstalk and my REACTAPP gets deployed as static files to S3. I've tried a few things but decided that the only viable approach is to manually perform this deploy step within my own 03_build.sh script - because there's no way to build this dynamically within Codebuild's Deploy step (I could be wrong).
Anyway, my issue is that I essentially need to create a decision tree to determine which project gets excecuted, so if I make a change to API and push, it doesn't automatically deploy REACTAPP to S3 unnecessarliy (and vica versa).
I managed to get this working on localhost by updating environment variables at certain points in the build process and then reading them in separate steps. However this fails on Codedeploy because of permission issues i.e. I don't seem to be able to update env variables from within the CI process itself.
Explicitly, my buildconf.yml looks like this:
version: 0.2
env:
variables:
VARIABLES: 'here'
AWS_ACCESS_KEY_ID: 'XXXX'
AWS_SECRET_ACCESS_KEY: 'XXXX'
AWS_REGION: 'eu-west-1'
AWS_BUCKET: 'mybucket'
phases:
install:
commands:
- sh ./scripts/01_install.sh
pre_build:
commands:
- sh ./scripts/02_prebuild.sh
build:
commands:
- sh ./scripts/03_build.sh
I'm running my own shell scripts to perform some logic and I'm trying to pass variables between scripts: install->prebuild->build
To give one example, here's the 01_install.sh where I diff each project version to determine whether it needs to be updated (excuse any minor errors in bash):
#!/bin/bash
# STAGE 1
# _______________________________________
# API PROJECT INSTALL
# Do if API version was changed in prepush (this is just a sample and I'll likely end up storing the version & previous version within the package.json):
if [[ diff ./api/version.json ./api/old_version.json ]] > /dev/null 2>&1
## then
echo "🤖 Installing dependencies in API folder..."
cd ./api/ && npm install
## Set a variable to be used by the 02_prebuild.sh script
TEST_API="true"
export TEST_API
else
echo "No change to API"
fi
# ______________________________________
# REACTAPP PROJECT INSTALL
# Do if REACTAPP version number has changed (similar to above):
...
Then in my next stage I read these variables to determine whether I should run tests on the project 02_prebuild.sh:
#!/bin/bash
# STAGE 2
# _________________________________
# API PROJECT PRE-BUILD
# Do if install was initiated
if [[ $TEST_API == "true" ]]; then
echo "🤖 Run tests on API project..."
cd ./api/ && npm run tests
echo $TEST_API
BUILD_API="true"
export BUILD_API
else
echo "Don't test API"
fi
# ________________________________
# TODO: Complete for REACTAPP, similar to above
...
In my final script I use the BUILD_API variable to build to the dist folder, then I deploy that to either Elastic Beanstalk (for API) or S3 (for REACTAPP).
When I run this locally it works, however when I run it on Codebuild I get a permissions failure presumably because my bash scripts cannot export ENV_VAR. I'm wondering either if anyone knows how to update ENV_VARIABLES from within the build process itself, or if anyone has a better approach to achieve my goals (conditional/ variable build process on Codebuild)
EDIT:
So an approach that I've managed to get working is instead of using Env variables, I'm creating new files with specific names using fs then reading the contents of the file to make logical decisions. I can access these files from each of the bash scripts so it works pretty elegantly with some automatic cleanup.
I won't edit the original question as it's still an issue and I'd like to know how/ if other people solved this. I'm still playing around with how to actually use the eb deploy and s3 cli commands within the build scripts as codebuild does not seem to come with the eb cli installed and my .ebextensions file does not seem to be honoured.
Source control repos like Github can be configured to send a post event to an API endpoint when you push to a branch. You can consume this post request in lambda through API Gateway. This event data includes which files were modified with the commit. The lambda function can then process this event to figure out what to deploy. If you’re struggling with deploying to your servers from the codebuild container, you might want to try posting an artifact to s3 with an installable package and then have your server grab it from there.

Gitlab-Ci: How could I share data between jobs

I want to share a file between two jobs and modify it if there are changed files. The python script compare the cache.json file with changes and modify the cahce file sometimes.
.gitlab-ci.yaml:
image: ubuntu
stages:
- test
cache:
key: one-cache
paths:
- cache.json
job1:
stage: test
script:
# - touch cache.json
- cat cache.json
- python3 modify_json_file.py
- cat cache.json
The problem is that it the cache.json file not exist at the next job run. I get the error message: cat: cache.json: No such file or directory. I did also insert once the touch command, but this doesn't change anything for the next run without the touch command.
Do I something wrong or don't I understand the cache at gitlab wrong.
I think you need artifacts and not cache.
From cache vs artifact:
cache - Use for temporary storage for project dependencies. Not useful for keeping intermediate build results, like jar or apk files. Cache was designed to be used to speed up invocations of subsequent runs of a given job, by keeping things like dependencies (e.g., npm packages, Go vendor packages, etc.) so they don't have to be re-fetched from the public internet. While the cache can be abused to pass intermediate build results between stages, there may be cases where artifacts are a better fit.
artifacts - Use for stage results that will be passed between stages. Artifacts were designed to upload some compiled/generated bits of the build, and they can be fetched by any number of concurrent Runners. They are guaranteed to be available and are there to pass data between jobs. They are also exposed to be downloaded from the UI.

Resources