We are using nexus oss 3.13 as a private docker registry.
During development due to misconfiguration, some images/layers can get extremly big.
Currently we have a nexus groovy script which generates a report of the biggest files (==layer), but there's no way to find out the corresponding images.
For production this is a show-stopper. Therefore we can not delete the images, which are using the big layers, because we do not know which image is affected.
We are surprised, that such basic functionality is not provided.
Did we miss something in the documentation?
How are others tackling this problem?
Has someone a good approach/workaround (maybe a groovy script) to match the docker layers to the docker images in order to solve this issue?
You can copy the non-truncated ID (SHA256) of the layer and grep for it in the folder /var/lib/docker/image.
This will find a file that has a SourceRepository JSON field:
$/var/lib/docker/image# find . -name *aae63f31dee9107165b24afa0a5e9ef9c9fbd079ff8a2bdd966f8c5d8736cc98*
./overlay2/distribution/v2metadata-by-diffid/sha256/aae63f31dee9107165b24afa0a5e9ef9c9fbd079ff8a2bdd966f8c5d8736cc98
Then when we cat that file, we can see the SourceRepository field I referred to above:
/var/lib/docker/image# cat ./overlay2/distribution/v2metadata-by-diffid/sha256/aae63f31dee9107165b24afa0a5e9ef9c9fbd079ff8a2bdd966f8c5d8736cc98
[{"Digest":"sha256:9931fdda3586a52049081bc78fa9793476662310356127cc8baa52e38bb34a8d","SourceRepository":"docker.io/library/mysql","HMAC":""}]
In the above we can see that the Source image is "MySQL" which I picked a layer from randomly.
As of the moment I don't believe there's a built-in way to accomplish this, maybe it's worth submitting a feature request.
Related
I have an azure blob container with data which I have not uploaded myself. The data is not locally on my computer.
Is it possible to use dvc to download the data to my computer when I haven’t uploaded the data with dvc? Is it possible with dvc import-url?
I have tried using dvc pull, but can only get it to work if I already have the data locally on the computer and have used dvc add and dvc push .
And if I do it that way, then the folders on azure are not human-readable. Is it possible to upload them in a human-readable format?
If it is not possible is there then another way to download data automatically from azure?
I'll build up on #Shcheklein's great answer - specifically on the 'external dependencies' proposal - and focus on your last question, i.e. "another way to download data automatically from Azure".
Assumptions
Let's assume the following:
We're using a DVC pipeline, specified in an existing dvc.yaml file. The first stage in the current pipeline is called prepare.
Our data is stored on some Azure blob storage container, in a folder named dataset/. This folder follows a structure of sub-folders that we'd like to keep intact.
The Azure blob storage container has been configured in our DVC environment as a DVC 'data remote', with name myazure (more info about DVC 'data remotes' here)
High-level idea
One possibility is to start the DVC pipeline by synchronizing a local dataset/ folder with the dataset/ folder on the remote container.
This can be achieved with a command-line tool called azcopy, which is available for Windows, Linux and macOS.
As recommended here, it is a good idea to add azcopy to your account or system path, so that you can call this application from any directory on your system.
The high-level idea is:
Add an initial update_dataset stage to the DVC pipeline that checks if changes have been made in the remote dataset/ directory (i.e., file additions, modifications or removals).
If changes are detected, the update_datset stage shall use the azcopy sync [src] [dst] command to apply the changes on the Azure blob storage container (the [src]) to the local dataset/ folder (the [dst])
Add a dependency between update_dataset and the subsequent DVC pipeline stage prepare, using a 'dummy' file. This file should be added to (a) the outputs of the update_dataset stage; and (b) the dependencies of the prepare stage.
Implementation
This procedure has been tested on Windows 10.
Add a simple update_dataset stage to the DVC pipeline by running:
$ dvc stage add -n update_dataset -d remote://myazure/dataset/ -o .dataset_updated azcopy sync \"https://[account].blob.core.windows.net/[container]/dataset?[sas token]\" \"dataset/\" --delete-destination=\"true\"
Notice how we specify the 'dummy' file .dataset_updated as an output of the stage.
Edit the dvc.yaml file directly to modify the command of the update_dataset stage. After the modifications, the command shall (a) create the .dataset_updated file after the azcopy command - touch .dataset_updated - and (b) pass the current date and time to the .dataset_updated file to guarantee uniqueness between different update events - echo %date%-%time% > .dataset_updated.
stages:
update_dataset:
cmd: azcopy sync "https://[account].blob.core.windows.net/[container]/dataset?[sas token]" "dataset/" --delete-destination="true" && touch .dataset_updated && echo %date%-%time% > .dataset_updated # updated command
deps:
- remote://myazure/dataset/
outs:
- .dataset_updated
...
I recommend editing the dvc.yaml file directly to modify the command, as I wasn't able to come up with a complete dvc add stage command that took care of everything in one go.
This is due to the use of multiple commands chained by &&, special characters in the Azure connection string, and the echo expression that needs to be evaluated dynamically.
To make the prepare stage depend on the .dataset_updated file, edit the dvc.yaml file directly to add the new dependency, e.g.:
stages:
prepare:
cmd: <some command>
deps:
- .dataset_updated # add new dependency here
- ... # all other dependencies
...
Finally, you can test different scenarios on your remote side - e.g., adding, modifying or deleting files - and check what happens when you run the DVC pipeline up till the prepare stage:
$ dvc repro prepare
Notes
The solution presented above is very similar to the example given in DVC's external dependencies documentation.
Instead of the az copy command, it uses azcopy sync.
The advantage of azcopy sync is that it only applies the differences between your local and remote folders, instead of 'blindly' downloading everything from the remote side when differences are detected.
This example relies on a full connection string with an SAS token, but you can probably do without it if you configure azcopy with your credentials or fetch the appropriate values from environment variables
When defining the DVC pipeline stage, I've intentionally left out an output dependency with the local dataset/ folder - i.e. the -o dataset part - as it was causing the azcopy command to fail. I think this is because DVC automatically clears the folders specified as output dependencies when you reproduce a stage.
When defining the azcopy command, I've included the --delete-destination="true" option. This allows synchronization of deleted files, i.e. files are deleted on your local dataset folder if deleted on the Azure container.
Please, bear with me, since you have a lot of questions. Answer needs a bit structure and background to be useful. Or skip to the very end to find some new ways of doing Is it possible to upload them in a human-readable format? :). Anyways, please let me know if that solves your problem, and in general would be great to have a better description of what you are trying to accomplish at the end (high level description).
You are right that by default DVC structures its remote in a content-addressable way (which makes it non human-readable). There are pros and cons to this. It's easy to deduplicate data, it's easy to enforce immutability and make sure that no one can touch it directly and remove something, directory names in projects make it connected to actual project and their meaning, etc.
Some materials on this: Versioning Data and Models, my answer of on how DVC structures its data, upcoming Data Management User Guide section (WIP still).
Saying that, it's clear there are downsides to this approach, especially when it comes to managing a lot of objects in the cloud (e.g. millions of images, etc). To name a few concerns that I see a lot as a pattern:
Data has been created (and being updated) by someone else. There is some ETL, third party tool, etc. We need to keep that format.
Third party tool expect to have data in "human" readable way. It doesn't integrate with DVC to being able to access it indirectly via Git. (one of the examples - Label Studio need direct links to S3).
It's not practical to move all of data into DVC, it doesn't make sense to instantiate all the files at once as one directory. Users need slices, usually based on some annotations (metadata), etc.
So, DVC has multiple features to deal with data in its own original layout:
dvc import-url - it'll download objects, it'll cache them, and will by default push (dvc push) to remote to again save them to guarantee reproducibility (this can be changed). This command creates a special file .dvc that is being used to detect changes in the cloud to see if DVC needs to download something again. It should cover the case for "to download data automatically from azure".
dvc get-url - this more or less wget or rclone or aws s3 cp, etc with multi cloud support. It just downloads objects.
A bit advanced thing (if you DVC pipelines):
Similar to import-url but for DVC pipelines - external dependencies
The the third (new) option. It's in beta phase, it's called "cloud versioning" and essentially it tries to keep the storage human readable while still benefit from using .dvc files in Git if you need them to reference an exact version of the data.
Cloud Versioning with DVC (it's WPI when I write this, if PR is merged it means you can find it in the docs
The document summarizes well the approach:
DVC supports the use of cloud object versioning for cases where users prefer to retain their original filenames and directory hierarchy in remote storage, in exchange for losing the de-duplication and performance benefits of content-addressable storage. When cloud versioning is enabled, DVC will store files in the remote according to their original directory location and
filenames. Different versions of a file will then be stored as separate versions of the corresponding object in cloud storage.
I want to do some configurations for Google Cloud Ops-Agent in order to deploy it via Ansible.
For example /etc/google-cloud-ops-agent/kafka.yaml
How to include *.yaml configs?
If using /etc/google-cloud-ops-agent/config.yaml I'm worried then the configuration will be overwritten
There are two ways I can think of to do this.
The easiest (and least precise): use the copy module to recusively copy the the directory content to the target. Of course, if there are files other than ".yaml", you'll get those as well.
The more complex way...and I have not tested this. use the find module to execute locally on the control node, to get a list of the .yaml files, register their locations and then copy them up. There's probably a simpler way.
I have an application that uses about 20GB of raw data. The raw data consists of binaries.
The files rarely - if ever - change. Changes only happen if there are errors within the files that need to be resolved.
The most simple way to handle this would be to put the files in its own git repository and create a base image based on that. Then build the application on top of the raw data image.
Having a 20GB base image for a CI pipeline is not something I have tried and does not seem to be the optimal way to handle this situation.
The main reason for my approach here ist to prevent extra deployment complexity.
Is there a best practice, "correct" or more sensible way to do this?
Huge mostly-static data blocks like this are probably the one big exception to me to the “Docker images should be self-contained” rule. I’d suggest keeping this data somewhere else, and download it separately from the core docker run workflow.
I have had trouble in the past with multi-gigabyte images. Operations like docker push and docker pull in particular are prone to hanging up on the second gigabyte of individual layers. If, as you say, this static content changes rarely, there’s also a question of where to put it in the linear sequence of layers. It’s tempting to write something like
FROM ubuntu:18.04
ADD really-big-content.tar.gz /data
...
But even the ubuntu:18.04 image changes regularly (it gets security updates fairly frequently; your CI pipeline should explicitly docker pull it) and when it does a new build will have to transfer this entire unchanged 20 GB block again.
Instead I would put them somewhere like an AWS S3 bucket or similar object storage. (This is a poor match for source control systems, which (a) want to keep old content forever and (b) tend to be optimized for text rather than binary files.). Then I’d have a script that runs on the host that downloads that content, and then mount the corresponding host directory into the containers that need it.
curl -LO http://downloads.example.com/really-big-content.tar.gz
tar xzf really-big-content.tar.gz
docker run -v $PWD/really-big-content:/data ...
(In Kubernetes or another distributed world, I’d probably need to write a dedicated Job to download the content into a Persistent Volume and run that as part of my cluster bring-up. You could do the same thing in plain Docker to download the content into a named volume.)
I'm not too familiar with Kafka but I would like to know what's the best way to
read data in batches from Kafka so I can use Elasticsearch Bulk Api to load the data faster and reliably.
Btw, am using Vertx for my Kafka consumer
Thank you,
I cannot tell if this is the best approach or not, but when I started looking for similar functionality I could not find any readily available frameworks. I found this project:
https://github.com/reachkrishnaraj/kafka-elasticsearch-standalone-consumer/tree/branch2.0
and started contributing to it as it was not doing everything I wanted, and was also not easily scalable. Now the 2.0 version is quite reliable and we use it in production in our company processing/indexing 300M+ events per day.
This is not a self-promotion :) - just sharing how we do the same type of work. There might be other options right now as well, of course.
https://github.com/confluentinc/kafka-connect-elasticsearch
Or You can try this source
https://github.com/reachkrishnaraj/kafka-elasticsearch-standalone-consumer
Running as a standard Jar
**1. Download the code into a $INDEXER_HOME dir.
**2. cp $INDEXER_HOME/src/main/resources/kafka-es-indexer.properties.template /your/absolute/path/kafka-es-indexer.properties file - update all relevant properties as explained in the comments
**3. cp $INDEXER_HOME/src/main/resources/logback.xml.template /your/absolute/path/logback.xml
specify directory you want to store logs in:
adjust values of max sizes and number of log files as needed
**4. build/create the app jar (make sure you have MAven installed):
cd $INDEXER_HOME
mvn clean package
The kafka-es-indexer-2.0.jar will be created in the $INDEXER_HOME/bin. All dependencies will be placed into $INDEXER_HOME/bin/lib. All JAR dependencies are linked via kafka-es-indexer-2.0.jar manifest.
**5. edit your $INDEXER_HOME/run_indexer.sh script: -- make it executable if needed (chmod a+x $INDEXER_HOME/run_indexer.sh) -- update properties marked with "CHANGE FOR YOUR ENV" comments - according to your environment
**6. run the app [use JDK1.8] :
./run_indexer.sh
I used spark streaming and the it was quite a simple implementation using Scala.
After following this simple tutorial http://www.louisaslett.com/RStudio_AMI/ and video guide http://www.louisaslett.com/RStudio_AMI/video_guide.html I have setup an RStudio environment on EC2.
The only problem is, I can't upload large files (> 1GB).
I can upload small files just fine.
When I try to upload a file via RStudio, it gives me the following error:
Unexpected empty response from server
Does anyone know how I can upload these large files for use in RStudio? This is the whole reason I am using EC2 in the first place (to work with big data).
Ok so I had the same problem myself and it was incredibly frustrating, but eventually I realised what was going on here. The default home directory size for AWS is less than 8-10GB regardless of the size of your instance. As this as trying to upload to home then there was not enough room. An experienced linux user would not have fallen into this trap, but hopefully any other windows users new to this who come across this problem will see this. If you upload into a different drive on the instance then this can be solved. As the Louis Aslett Rstudio AMI is based in this 8-10GB space then you will have to set your working directory outside this, the home directory. Not intuitively apparent from Rstudio server interface. Whilst this is an advanced forum and this is a rookie error I am hoping no one deletes this question as I spent months on this and I think someone else will too. I hope this makes sense to you?
Don't you have shell access to your Amazon server? Don't rely on RStudio's upload (which may have a 2Gb limit, reasonably) and use proper unix dev tools:
rsync -avz myHugeFile.dat amazonusername#my.amazon.host.ip:
on your local PC command line (install cygwin or other unixy compatibility system) will transfer your huge file to your amazon server, and if interrupted will resume from that point, will compress the data for transfer too.
For a windows gui on something like this, WinSCP was what we used to do in the bad old days before Linux.
This could have something to do with your web server. Are you using nginx or apache as your web server. If so you can modify the upload feature in your nginx server. If you are running nginx on the front end of the web server I would recommend the following fix in your nginx.conf file.
http {
...
client_max_body_size 100M;
}
https://www.tecmint.com/limit-file-upload-size-in-nginx/
I had a similar problems with a 5GB file. What worked for me was to use SQLite to create a database with the csv file that I needed. Use SQLite code to bring create the database. Then I used a function in RStudio to communicate with the local database. In that way, I was able to bring in the csv file. I can track down the R code that I used if you like.