How to list the published container images in the Google Container Registry in a CLI in image size order - image

Using a CLI, I want to list the images in each repository in a Google Container Registry project but with the following conditions:
Lists the images with the latest tag only
Lists the human-readable size of the images
Lists the name of the images
The closest I've managed to get us through gsutil:
gsutil du -h gs://eu.artifacts.my-registry.appspot.com/containers/images
Resulting in:
33.77 MiB gs://eu.artifacts.my-registry.appspot.com/containers/images/sha256:03c1a2387ef6cb30a7428a46821f946d6a2c591a26cb2066891c55b2b6846ae2
1.27 MiB gs://eu.artifacts.my-registry.appspot.com/containers/images/sha256:03c1e7db6bf0140bd5fa34236a35453cb73cef01f6d89b98bc5995ae8ea07aaf
1.32 KiB gs://eu.artifacts.my-registry.appspot.com/containers/images/sha256:03c3c97495d60c68d37d04a7e6c9b3a48bb159ce5dde13d0d81b4e75e2a3f1d4
81.92 KiB gs://eu.artifacts.my-registry.appspot.com/containers/images/sha256:03c5483cb8ac9c9ae498507e15d68d909a11859a8e5238556b7188e0af4d9264
457.43 KiB gs://eu.artifacts.my-registry.appspot.com/containers/images/sha256:03c7f98faa1cfc05264e743e23ca2e118d24c57bfd67d5cb2e2c7a57e8124b6c
7.88 KiB gs://eu.artifacts.my-registry.appspot.com/containers/images/sha256:03c83b13d044844cd3f6b278382e408541f22029acaf55d9e7e5689b8d51eeea
But obviously this does not meet most of my criteria.
The information is available through the GUI like so on a per image basis:
Any ideas?
I'm open to gsutil, gcloud, docker, anything really which can be installed on a docker container.

You can use the Google Cloud UI to accomplish this. There's a column selector right next to the filter bar and it has an option for the image size.
Once the column is displayed, you'll be able to order by size.

Its seems you have only one outstanding issue with listing container images size after reading your comment at Jason's answer. So it is not possible to retrieve with gcloud command directly. Here are two work around I tested:
You can use gcloud container images describe command to see the size of the images. Make sure you use "--log-http" flag with it. Command should be like this:
$ gcloud container images describe gcr.io/myproject/myimage:tag --log-http
Another way to get the size of the image is using gsutil stat command.
So here's what I did:
a. Upon running below command, I listed all my images from the GCS bucket and saved it to a file called images.txt
$ gsutil ls "BUCKET URL" > images.txt
b. I ran gcloud stat command like below to read image names from the images.txt file and return size of the images chronologically.
$ for x in $(cat images.txt); do `gsutil stat $x | grep Content-Length | awk '{print $2}'`; done
You can customize this little script according to your need.
I understand these are not efficient workaround but thats all seems to be an option now. However, GCR just implements the docker container API, so may be you can read this document to see if you can find/do something of your own.

Hi here just to share a rudimental script which takes the first tag and get the size of the whole layers and write it on a report, it takes ages on 3TB repo but at least i know which repo is big.
echo "REPO,SIZE" > repository-size-report.csv
for REPO in $(gcloud container images list --repository eu.gcr.io/comerge-comerge01-171833 --format="table[no-heading](NAME)") ; do
for TAGS in $(gcloud container images list-tags $REPO --format="table[no-heading](TAGS)"); do
TAG=$(echo $TAGS | cut -d, -f1)
SUM=0
for SIZE in $(gcloud container images describe $REPO:$TAG --log-http 2>&1 | grep size | grep -o '[0-9][0-9]*') ; do
SUM=$((SUM + SIZE))
done
HSUM=$(echo $SUM | numfmt --to iec --format "%8f")
echo "$REPO:$TAG,$HSUM"
echo "$REPO:$TAG,$HSUM" >> repository-size-report.csv
done
done

You can use the command gcloud container images list command to accomplish this task; however, you will need to set the appropriate flags to fulfill your use case. You can read more about the command and the flag options here.

Related

Sum sizes in different units

For context, I'm trying to compute the total size in bytes taken by docker images on my machine. I know of docker system df, but I want to understand how I can do this in general.
If I run docker image ls -a, I get something like this:
REPOSITORY TAG IMAGE ID CREATED SIZE
debian 8 00b72214a37e 3 days ago 129MB
debian latest 971452c94376 3 days ago 114MB
Now I'd like to sum the SIZE column, so I can remove the first row with tail +2, and then use awk to sum the 7th column (using this):
docker image ls -a | tail +2 | awk '{s+=$7}END{print s}'
This command will correctly give me the total size in MB (243MB).
However, if an image has its size in GB, awk will add it to the sum but will ignore the unit, so for instance, the same command would return 244MB instead of 1.243GB on the following images:
REPOSITORY TAG IMAGE ID CREATED SIZE
debian 8 00b72214a37e 3 days ago 129MB
debian latest 971452c94376 3 days ago 114MB
debian latest 971452c94376 3 days ago 1GB
How can I tweak my command to have it support sizes (or values in general) with different metric prefixes? I don't necessarily want the output to be formatted in any way, for instance an output in bytes would be fine.
You can do this reliably with different docker command in bash using this script:
tot=
while read id; do
(( tot += $(docker image inspect --format='{{.Size}}' $id) ))
done < <(docker image ls --format='{{.ID}}')
echo "total-size-in-bytes=$tot"
Note that:
docker image inspect --format='{{.Size}}' prints size of a given image in bytes
docker image ls --format='{{.ID}}' prints all images IDs

Bintray docker repository storage

On Bintray I found out that I have a private docker repository consuming quite a lot of space:
Account usage by repository
I then proceeded to do some house keeping and kept only the last 3 tags of all the images I have. However, that didn't help much. The storage didn't change at all after deleting all these old tags.
I got this API endpoint here: https://bintray.com/docs/api/#_get_package_files to have an estimate on the package files size:
for img in $(cat images) ; do curl -s -XGET -u "user:pass" https://bintray.com/api/v1/packages/my-org/internal-docker/$img/files | python -m json.tool | jq '.[] | .size' | awk '{ sum += $1 } END { print sum }' ; done
Suming all those up gets me 63723101568 bytes, 60GB.
Any idea where the other 310GBs are?
Notice that, even if the 3 tags were completely different from each other, I would get worst case 3x that figure, so 180GB. But the 375GB is still there.
Where are you getting the array 'images'?
You are not curling for the entire list but rather single files.
It's possible your list did not include all the images in this repo.
Check to make sure you have traversed all subfolders for images as well.
After some time, something changed in the backend storage
I asked on their support (you have to click on Feedback when logged in to Bintary) and they're checking if there was any house keeping done or only after I complained to them something was done.
I'll update if I hear more from them.

export github commits/names to CSV with bash & jq

For a project I need to extract data from a lot of different blockchain GitHub profiles to a csv.
After browsing through the GitHub API I was able to achieve some of the necessary data being shown as txt/csv files using bash commands and jq.
Now doing all of this manually would probably take 7 days. I have a list of profiles i need to loop through saved as CSV.
The list looks like this --> https://docs.google.com/spreadsheets/d/1lFsewAYI7F8zSw7WPhI9E9WwR8f4G1clw1yjxY3wz_4/edit#gid=0
My approach so far to get all the repo names looks like this:
sample='[{"name":"0chain"},{"name":"0stateapp"},{"name":"0xcert"}]'
the csv belongs in here, I didn't know how to redirect it to that variable yet, but for testing purposes this was enough. If somebody knows how to, feel free to give a hint.
for row in $(echo "${sample}" | jq -r '.[] | #base64'); do
_jq()
{
echo ${row} | base64 --decode | jq -r ${1}
}
for GHUSER in $( echo $(_jq '.name')); do
curl -s https://api.github.com/users/$GHUSER/repos?per_page=100 | jq -r '.[]|.full_name'
done
done
The output looks like this:
0chain/0chain-token
0chain/client-sdk
0chain/docs
0chain/gorocksdb
0chain/hostadmin
0chain/rocksdb
0stateapp/ZSCoin
0xcert/0xcert
0xcert/conventions
0xcert/docs
0xcert/erc721-validator
0xcert/erc721-validator-api
0xcert/erc721-validator-ui
0xcert/erc721-website
0xcert/ethereum
0xcert/ethereum-crowdsale
0xcert/ethereum-dex
0xcert/ethereum-erc20
0xcert/ethereum-erc721
0xcert/ethereum-minter
0xcert/ethereum-utils
0xcert/ethereum-xcert
0xcert/ethereum-xcert-builder
0xcert/ethereum-zxc
0xcert/framework
0xcert/framework-cert-test
0xcert/nonfungiblealliance-www
0xcert/solidity-style-guide
0xcert/techpaper
0xcert/truffle
0xcert/web3.js
What I need to do is use all of the above values and generate a file that contains:
Github Profile (already stored in the attached sheet)
The Date when accessing this information
All the repositories belonging to that profile (code above but
filtered)
Now the Interesting part:
The commit history
number of commit (ID)
number of commit (ID)
Date of commit
Description of commit
person who commited
checks passed
checks failed
Almost the same needs to be done for closed and open pull requests although I think when solving the "problem" above solving the pull requests is the same strategy.
For the commits I'd do something like this:
for commits in $( $repoarray) do curl -i https://api.github.com/repos/$commits/commits | jq -r '.[]|.author.lgoin (and whatever els is needed)' done
basically this chart here needs to be filled
https://docs.google.com/spreadsheets/d/1mFXiohiWNXNP8CVztFA1PFF41jn3J9sRUhYALZShsPY/edit?usp=sharing
what I need help with:
storing my output from the first loop in a an array
loop through that array to get the number of commits
loop through that array to get the data to closed pull requests
loop through that array to get the data to open pull requests
Excuse my "noobish" question.
I'm using bash/jq and the GitHub API for the time.
I'd appreciate any kind of help.

s3 awk bash pipeline

Following this question Splitting out a large file.
I would like to pipe calls from an Amazon s3:// bucket containing large gzipped files, process them with an awk command.
Sample file to process
...
{"captureTime": "1534303617.738","ua": "..."}
...
Script to optimize
aws s3 cp s3://path/to/file.gz - \
| gzip -d \
| awk -F'"' '{date=strftime("%Y%m%d%H",$4); print > "splitted."date }'
gzip splitted.*
# make some visual checks here before copying to S3
aws s3 cp splitted.*.gz s3://path/to/splitted/
Do you think I can wrap everything in the same pipeline to avoid writing files locally?
I can use Using gzip to compress files to transfer with aws command to be able to gzip and copy on the fly, but gzipping inside awk would be great.
Thank you.
Took me a bit to understand that your pipeline creates one "splitted.date file for each line in the source file. Since shell pipelines operate on byte streams and not files, while S3 operates on files (objects), you must turn your byte stream into a set of files on local storage before sending them back to S3. So, a pipeline by itself won't suffice.
But I'll ask: what's the larger purpose you trying to accomplish?
You're on the path to generating lots of S3 objects, one for each line of your "large gzipped files". Is this using S3 as a key value store? I'll ask if this is the best design for the goal of your effort? In other words, is S3 the best repository for this information or is here some other store (DynamoDB, or another NoSQL) that would be a better solution?
All the best
Two possible optimizations :
On large and multiple files it will help to use all the cores to gzip the files, use xargs, pigz or gnu parallel
Gzip with all cores
parallelize S3 upload :
https://github.com/aws-samples/aws-training-demo/tree/master/course/architecting/s3_parallel_upload

Enhanced docker stats command with total amount of RAM and CPU

I just want to share a small script that I made to enhance the docker stats command.
I am not sure about the exactitude of this method.
Can I assume that the total amount of memory consumed by the complete Docker deployment is the sum of each container consumed memory ?
Please share your modifications and or corrections. This command is documented here: https://docs.docker.com/engine/reference/commandline/stats/
When running a docker stats The output looks like this:
$ docker stats --all --format "table {{.MemPerc}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.Name}}"
MEM % CPU % MEM USAGE / LIMIT NAME
0.50% 1.00% 77.85MiB / 15.57GiB ecstatic_noether
1.50% 3.50% 233.55MiB / 15.57GiB stoic_goodall
0.25% 0.50% 38.92MiB / 15.57GiB drunk_visvesvaraya
My script will add the following line at the end:
2.25% 5.00% 350.32MiB / 15.57GiB TOTAL
docker_stats.sh
#!/bin/bash
# This script is used to complete the output of the docker stats command.
# The docker stats command does not compute the total amount of resources (RAM or CPU)
# Get the total amount of RAM, assumes there are at least 1024*1024 KiB, therefore > 1 GiB
HOST_MEM_TOTAL=$(grep MemTotal /proc/meminfo | awk '{print $2/1024/1024}')
# Get the output of the docker stat command. Will be displayed at the end
# Without modifying the special variable IFS the ouput of the docker stats command won't have
# the new lines thus resulting in a failure when using awk to process each line
IFS=;
DOCKER_STATS_CMD=`docker stats --no-stream --format "table {{.MemPerc}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.Name}}"`
SUM_RAM=`echo $DOCKER_STATS_CMD | tail -n +2 | sed "s/%//g" | awk '{s+=$1} END {print s}'`
SUM_CPU=`echo $DOCKER_STATS_CMD | tail -n +2 | sed "s/%//g" | awk '{s+=$2} END {print s}'`
SUM_RAM_QUANTITY=`LC_NUMERIC=C printf %.2f $(echo "$SUM_RAM*$HOST_MEM_TOTAL*0.01" | bc)`
# Output the result
echo $DOCKER_STATS_CMD
echo -e "${SUM_RAM}%\t\t\t${SUM_CPU}%\t\t${SUM_RAM_QUANTITY}GiB / ${HOST_MEM_TOTAL}GiB\tTOTAL"
From the documentation that you have linked above,
The docker stats command returns a live data stream for running containers.
To limit data to one or more specific containers, specify a list of container names or ids separated by a space.
You can specify a stopped container but stopped containers do not return any data.
and then furthermore,
Note: On Linux, the Docker CLI reports memory usage by subtracting page cache usage from the total memory usage.
The API does not perform such a calculation but rather provides the total memory usage and the amount from the page cache so that clients can use the data as needed.
According to your question, it looks like you can assume so, but also do not forget it also factors in containers that exist but are not running.
Your docker_stats.sh does the job for me, thanks!
I had to add unset LC_ALL somewhere before LC_NUMERIC is used though as the former overrides the latter and otherwise I get this error:
"Zeile 19: printf: 1.7989: Ungültige Zahl." This is probably due to my using a German locale.
There is also a discussion to add this feature to the "docker stats" command itself.
Thanks for sharing the script! I've updated it in a way it depends on DOCKER_MEM_TOTAL instead of HOST_MEM_TOTAL as docker has it's own memory limit which could differ from host total memory count.

Resources