Reproducible Debian install - apt

Is there a way to create a clean Debian-based image (I want it for a container, but it could also be for a virtual) with custom selection of packages that would be binary exactly the same as long as the installed packages and debconf parameters are the same?
There would be basically two uses for this:
An image that specifies what exact versions of packages it contains could be independently verified (using snapshots or rebuilding packages as far as Debian managed to make those builds reproducible)
Easy checking whether any of the packages has a new version, as the image could be simply rebuilt nightly and its checksum would only change once there were actual changes in the packages.
It could be built from a debian-published base image (e.g. the docker image debian:stable) and apt or using debootstrap (IIRC the base Debian image is built with debootstrap as well) or other suitable builder.

If you would like to guarantee that, build your image once, save it using docker save or docker push it somewhere and from then use that image as the base image.
docker save: https://docs.docker.com/engine/reference/commandline/save/
docker push: https://docs.docker.com/engine/reference/commandline/push/
EDIT: This wouldn't work, see comments below.

You can use mmdebstrap, which is supposed to create reproducible installations by default (if the SOURCE_DATE_EPOCH environment variable is set), if not I think that would be considered a bug.
Or you can use debuerreotype
There's also a wiki page tracking this for other tools in Debian at https://wiki.debian.org/ReproducibleInstalls.

Related

Dockerfile vs create image from container

Is there some difference between creating image using Dockerfile vs Creating image from container? (e.g. run a container from the same base as Dockerfile, transfer isntaller to the container, run them from command line and then create image from container).
At least I found out that installing VC Runtime from Windows Base docker container does not work :(
If you create an image using a Dockerfile, it's all but trivial to update the image by checking it out from source control, updating the tag on a base image or docker pulling a newer version of it, and re-running docker build.
If you create an image by running docker commit, and you discover in a year that there's a critical security vulnerability in the base Linux distribution and you need to stop using it immediately, you need to remember what it was you did a year ago to build the image and exactly what steps you did to repeat them, and you'd better hope you do them exactly the same way again. Oh, if only you had written down in a text file what base image you started FROM, what files you had to COPY in, and then what commands you need to RUN to set up the application in the image...
In short, writing a Dockerfile, committing it to source control, and running docker build is almost always vastly better practice than running docker commit. You can set up a continuous-integration system to rebuild the image whenever your source code changes; when there is that security vulnerability it's trivial to bump the FROM line to a newer base image and rebuild.

How to provide singularity images where users can add a custom set of software from a catalogue provided by us

We want to improve the reproducibility of the analyses at our institute. To this effect, we contemplate on implementing a system based on Singularity. The idea is that at the beginning of the analysis, the user can choose a machine configuration (later amendments must be possible) that sticks with them until the project is complete. Then, the image is archived with the analysis. Ideally, the user doesn't have to issue system admin commands (install packages etc.) in the process.
She just makes a request like "I need R with tidyverse and Python 3 and this and that in-house packages" and she gets a command that she can use to ssh into a singularity container that has those features. When she makes a new request, she gets the newest version of the programs but once the container has been deployed those versions don't change anymore.
It gets tricky when I think of the fact that multiple users will need different combinations of software. Do I need to provide an image for every combination of Software and software extension packages? If I only think of a scenario where users can choose of an arbitrary combination of {R, Julia, Python, r-tidyverse, r-data.table, r-whatever-genomic-analysis-package-on-bioconductor, python-...}
Is there a feature selection method in the veins of
singularity pull library://alpine:3.7 +r:3.2.1 +python3:3.7 +r-package:1.2.3
such that the user can
ssh cluster01 -- singularity shell project-abc.simg
and start/continue working?
If not, is there an alternative approach to supplying custom machine configurations to users using singularity?
I could find Singularity Compose, but this seems to just run multiple containers as services next to each other. So the images can stay separate. I have to merge them.
Yes, with Singularity, a dedicated image must be provided for each possible combination of packages.
Selecting a set of applications per-user is possible by changing your server configuration to the package managers Nix or GUIX, a fork of nix. The concept here is that each application/library lives within its own directory, whose name is a hash of the app! Therefore, multiple application versions can coexist and each application can link to another version of the same library.
A user can select a set of those directories as a user profile. This is a folder of symlinks into binaries in the proper application folders. From the Nix manual:
So, each user can setup their environment as they like, down to bitwise reproducability.
After the analysis, the profile can be turned into an image. I know its possible with GUIX using guix pack (tar, Docker, Singularity).
For Nix, I'm not sure. There is a project on GitHub, datakurre/nix-build-pack-docker, but it's dormant since 2015. Maybe it's enough to copy the needed subset of /nix/store into a folder, pull a NixOS image, and bind /nix/store of that image to your own folder?

Obtaining a docker image's parent images

Is there a way to obtain the docker parent image tree for a given image? I know
docker history IMG_NAME will provide an image id for the current image you're working with but everything else is missing. I've read this was taken out in v1.10 for security concerns but it seems to be a larger concern not being able to verify the tree of images that a final image was created from.
The other other thing I've found is docker save IMG_NAME -o TAR_OUTPUT.tar which will let you view all of the files in each layer but that seems pretty tedious.
How can I be assured that the only things modified in a given image for a piece of software is the installation and configuration of the software itself. It seems that being able to see the changes in the Dockerfiles used to generated each successive image would be an easy way to verify this.
Apart from has been said by chintan thakar, you will have to iterate maybe several times.
An example should clarify this
Suppose you want to dig into an image, and the Dockerfile to create this image starts with
FROM wordpress
so you go to
https://hub.docker.com/_/wordpress/
have a look at the Dockerfile, and you notice that
https://github.com/docker-library/wordpress/blob/0a5405cca8daf0338cf32dc7be26f4df5405cfb6/php5.6/apache/Dockerfile
starts with
FROM php:5.6-apache
so you go to the PHP 5.6 reference at
https://github.com/docker-library/php/blob/eadc27f12cfec58e270f8e37cd1b4ae9abcbb4eb/5.6/apache/Dockerfile
and you find the Dockerfile starts with
FROM debian:jessie
so you go to the Dockerfile of Debian jessie at
https://github.com/debuerreotype/docker-debian-artifacts/blob/af5a0043a929e0c87f7610da93bfe599ac40f29b/jessie/Dockerfile
and notice that this image is built like this
FROM scratch
ADD rootfs.tar.xz /
CMD ["bash"]
So you will need to do this if you want to see from where all the files come.
If there is a security issue notified, you will also need to do this, in order to know if you are concerned or not.

Can I transport just the image changes/layers i'm concerned with?

Say I use a Dockerfile to build an image from ubuntu:14.04, then install some things, add some code, then push to a repo where the image will be deployed for testing and eventually production.
My image works out to be > 2gbs. Most of that is the underlying and unchanging ubuntu:14.04 image layer.
Instead of shipping around my bloated image containing the ubuntu:14.04 base layer - theoretically i should be able to ensure my target systems already have this image - and i'd ship around just my higher level changes which would be applied on top.
(of course if the underlying image changed, i'd have to ensure the latest version was available on the target systems also)
Can we do this?
There is a tool called 'bub' which can split images into chunks and recognize only the differences. Check it out

Does dockstore-tool-samtools-index have GATK/BAM/Chromwell already configured?

Just wanted to know if the docker image with name dockstore-tool-samtools-index which is available here "https://quay.io/repository/cancercollaboratory/dockstore-tool-samtools-index"
and given as an input to the Google Genomics API (pipelines.create) contains the genome tools such as GATK/BWA or Cromwell.
Any help regarding this will be appreciated.
Thanks.
It does not appear to contain those additional tools: https://github.com/CancerCollaboratory/dockstore-tool-samtools-index/blob/master/Dockerfile
Here's how to check:
Find the docker container on https://dockstore.org/search-containers.
Click on the "GitHub" link in the row with the container of interest.
Read the contents of the Dockerfile to see what the image will contain.
One aspect of Docker images is that they usually try to have only one specific tool installed on them. This way the images are kept as small as possible with the idea that they can be used like modules.
There are images listed in the Dockstore search link provided by Nicole above which have BWA installed. Cromwell usually launches Docker containers rather than being installed on a Docker image, since it is more of workflow management system. You are always welcome to create your own custom image with the preferred installed software packages to fit what you need.
Hope it helps,
Paul

Resources