Why are there multiple copies of conda files? - anaconda

I installed Miniconda a while ago, and since then I've noticed there seem several copies of the same files (or files with very similar names) in different locations on my computer.
For example, almost the exact same files in my folder "C:/ProgramData/Miniconda/pkgs" are also in the folder "C:/Users/me/.conda/pkgs". I should note that the only other things in the ".conda" folder is an "environments.txt" file and and "envs" folder with a file called "conda_envs_dir_test".
I've also noticed that the folder "C:/ProgramData/Miniconda/Lib/site-packages" also contains files with very similar names.
Anyway, I wanted to ask if all this is necessary, and why? Sorry if this seems like a weird question. I'm still relativity new to programming.

Conda Package Caching
Conda downloads and unpacks packages into a package cache, and then uses hardlinking to install those packages into environments. One can freely delete the files in the package caches, though this undermines Conda's ability to minimize redundancy across environments going forward. The safest way to clear the package cache is to use the command
conda clean -tp
Multiple Package Caches
It should be noted that you appear to have two package caches, a system-level cache at C:/ProgramData/Miniconda/pkgs and a user-level cache at C:/Users/me/.conda/pkgs. This occurs when users install with the "Install for All Users" option. This is typically not recommended for regular end users, but rather more for System Administrators who are managing a multi-user installation. Conda functions perfectly (and arguably with less hassle) without ever needing elevated privileges.
All that to say, you may need to elevate your privileges for the conda clean command to also clear out the system-level cache. Additionally, if you haven't been using it too long, you may consider uninstalling the system-level install and reinstalling at the user level.

Related

Is it safe to delete ~/.conda/pkgs [duplicate]

I ran this command to release disk space on anaconda
$ conda clean --all
However, there are still some big files that remain in pkgs folder in anaconda python.
Is it safe to manually delete all the files in pkgs folder? Any risk of corrupting my anaconda environment? What are some side effects, if any?
I am using anaconda 2018 on windows 10.
Actually, under certain conditions it is an option to have the pkgs subdirs removed. As stated here by Anaconda Community Support "the pkgs directory is only a cache. You can remove it completely is you want to.
However, when creating new environments, it is more efficient to leave whatever packages are in the cache around."
According to the documentation you can use conda clean --packages to remove unused packages in pkgs (which will move them to pkgs/.trash from which you can then safely delete them). While this does not check for packages installed using symlinks back to the package cache, this is not a topic if you don't use such environments or work under Windows. I guess that's why conda clean --packages is included in conda clean --all.
To more aggressively save space you can use conda clean --force-pkgs-dirs to remove all writable package caches (with the same caveat that there could be environments linked to these dirs). If you don't use environments or use Anaconda under Windows, you're probably safe. Personally, I use this option without issues.
Edit Commentary
After reviewing the documentation pointed out in #Robert's answer, I must admit my initial response was overly alarmist and, in parts, blatantly incorrect. My apologies for the misleading response.
Nevertheless, I do believe some of what I raised still has some merit for this thread, and so I am deciding to retain the answer with amendments. In particular, I think it worth emphasizing that deleting the pkgs directory may not actually achieve what OP was hoping for (to save space) and that removing the package cache undermines Conda's redundancy minimization strategy going forward by making it impossible to share already installed packages.
Instead, my final recommendation concurs with what #Robert suggested, namely, use conda clean -p to delete unused packages, but keep the cache (pkgs dir) so that future environments can still leverage hardlinks. One last point to note, is that some tools, such as conda-pack, rely on the integrity of the package cache in order work, so deleting pkgs will prevent their use.
Amended Original Response
No, it is definitely not safe, and in fact the only way you would actually free disk space is if you broke your base env. The issue is that all envs use hardlinks to the pkgs directory, so even if you delete the link located in the pkgs directory, the ones in the envs will still be there and so you won't delete any physical files on the disk. The only real deletion you might do is something that is only referenced by base, i.e., the only copy is in pkgs, hence the potential for a breaking base.
Correction: The base env still links packages to other locations, so deleting pkgs will not impact base as I originally concluded.
I'd highly recommend looking at this other post on estimating the real disk usage of Conda. You may be overestimating how much space is really being used. For most files in pkgs, there is only one physical copy, so there isn't any additional manual optimization to be done.

What's the purpose of the "base" (for best practices) in Anaconda?

It says it's a default environment but "You don't want to put programs into your base environment, though"
So what exactly should I use it for? Do other environments I create inherit from the base?
The base environment is where conda itself gets installed. It's best to use Miniconda, and install all the things you want into separate environments.
Other environments do not inherit packages from the base environment. BUT the bin/ directory of the base environment is in the search path for executables. So if you call conda from inside any of your environments (which usually don't have conda installed), the one from the base environment is used.
If you install other executables into the base environment, they can be called from your other environments. But you'll have a hell of a tough time to distinguish whether the things you can call are actually in your environment, or in the base environment.
Therefore, it's best to just have conda in the base environment. And maybe other generic tools, like git or make, if you install that kind of tool with conda. But packages that are imported by your Python/R/whatever code do not belong into the base environment.
Don't worry about disk space if you create multiple environments with the same packages. conda does a very good job with hard-linking the same packages into multiple environments to save space.
The full Anaconda installer puts a ton of stuff into the base environment. That might seem convenient at first, but when you start creating new environments, you'll run into the problem I mentioned. You can call stuff from your new environment although it isn't installed there. Using Miniconda avoids this, at the cost of having to create a new environment before actually being able to use stuff. However, there's an anaconda meta-package which you can install to get the "ton of stuff" with one command.

Install package in separate area for read-only Anaconda Linux install

at work we have a central, read-only, Linux Anaconda installation, and several projects need library packages for their individual project members.
Is there a way to conda install packages in a writable area set aside for each project?
Our Linux servers are also not directly web connected, but we can transfer data from a Windows machine that is. Is there a way for the windows conda to download data for our Linux install in such a way that I can transfer the downloaded files to Linux and then finish the install on Linux , with the conda linux not needing a direct web connection?
Thanks in advance :-)
The best answer to this question is a bit oblique: the Anaconda Distribution is designed for a single user on a single system with unrestricted access to the Internet. Any other use is considered "off label" and YMMV, though there are no license restrictions in place preventing you from trying to use it as you see fit. Anaconda Enterprise is the commercial product that is specifically designed for multi-user, server-deployed Anaconda with firewall restrictions. Security, governance, indemnification, support, collaboration, etc. etc. Check out https://www.continuum.io/ for more details.
But there are "work around" ways to achieve what you want, albeit complicated ones. For it to be reliable, reproducible, and maintainable you're going to end up reimplementing a lot of what is in Anaconda Enterprsie. Here are some tips:
Check out the "conda in multi-user environments" documentation
Check out the "Centralized Anaconda installation" documentation
Regular user alice for project foo can do conda create -p /nfs/project/foo/envs/custompython --offline anaconda; conda activate /nfs/project/foo/envs/custompython; conda install pkg1 pkg2 pkg3
You're going to run into ownership/permission issues. If you have sensible umask values then when alice's colleague bob tries to update pkg2 in the foo project he'll discover that he can't unlink the files alice wrote there. There is stuff you can do (as the IT admin) with chown, or alice can do with chmod, but its all a bit of a bother and there are lots of ways you can paralyze a conda environment because it is expecting "writability" to be binary for a particular environment. There is a long history in the conda GH issue tracker of people (myself included) shooting themselves in the foot by starting a conda env setup with one account and then making mods with another account that bork out half way through, leaving everything inconsistent.
Be careful about .condarc files. My advice: avoid them everywhere but in the base Anaconda installation (say, inside /opt/anaconda/.condarc). All sorts of weird stuff can happen when multiple overlaying .condarc files come together (the docs reference above discusses this).
People can create their own environments in an "offline" mode so long as the packages specified in those new environments (and their dependencies) are a subset of the packages available in the base environment (or subsequently added to the package cache), taking into account versions as well, of course.
You can download packages using your online Windows machine by grabbing them from repo.continuum.io and from anaconda.org. Make sure you download them for the right platform. But the challenge: you need to download a set of packages that will satisfy the dependencies of the package you want to install. There isn't a super easy way to get that information when you're offline.
Once you drop new packages into the Linux system's package cache be sure to re-run conda index.
Beware installing packages directly from their tarballs: this will not pick up any dependencies and does what is called a "force" install. So doing conda install /path/to/conda/pkg-ver.tar.bz2 is actually most similar to doing conda install --force --no-deps pkg=ver (though not identical, to be sure). --force means the install will happen NO MATTER WHAT, even if it will break your environment (violate existing package dependencies), and --no-deps means you won't get any of the dependencies of pkg installed.

How to install a Chocolatey package completely offline?

I need to install software on Windows clients that are completely offline. That means they have no Internet access.
An example. Let's say I want to install Paint.Net. I go to a reference machine (with INet) and install Paint.Net with Chocolatey.
choco install paint.net -y
After the install is finished I have the software installed and two artifacts:
The package file "paint.net.nupkg" in %ChocolateyInstall%/lib/paint.net
and
the the installer file "paint.net.4.0.6.install.zip" in %Temp%\chocolatey.
I now put these two files on a USB stick. Then I go to the offline machine, plug in the USB stick and want to install the package.
Is it possible to install the software without modifying the package? I am aware that inside the nupkg file there is a tools/chocolateyInstall.ps1 file with a $url variable defined. But I want to install the package without changing the package content or modifying the URL by hand.
I played around with the parameters --cache and --source but with little to no luck.
I have seen that this kind of question is asked before. But never (to my knowledge) with the intend to run the installer file from the stick too (and not only the package file). So I hope this is not a duplicate.
Caching Downloads - Not Deterministic
While there are ways to set the original nupkg (with the version on it, not the one in the packages directory - use download from left side of package's page on the Chocolatey community package repository) and the cache onto a USB stick somewhere, it's not always deterministic that it will work. You can also override the cache location, so that the folder is somewhere not in TEMP. See choco config, choco config -h and choco config set cacheLocation c:\some\location to do this.
Create Your Own Packages - Better
For packages you need offline, you have the ability to manage your own packages and you can embed software right into the package. This is desired when you want to manage software offline as most things on the community repository are subject to copyright law and distribution rights (why they don't simply have the software they represent embedded).
Creating and working with your own packages is very secure, reliable, and repeatable (and can be completely offline), but it does tend to take up time. If you are doing this for yourself, then it could override any time-savings you get as a consumer using Chocolatey and the community repository.
Internalized Packages - Best
The best thing you can do here is a process called internalizing, where you download and extract the package, download all of the resources and embed them in the package (or put them somewhere local/UNC share), edit the scripts to use those embedded/local resources and recompile the package.
This allows you to take advantage of existing package logic without the issue of the internet.
For more details see Recompiling Packages and Package Internalizer - Automatically Recompile Packages.
NOTE: As a side note, we are thinking of offering the ability to auto recompile with Chocolatey Pro edition and not just the Business edition.
Organization Use of Chocolatey
Most organizations using Chocolatey are doing some combination of creating packages and recompiling packages, because they need absolute trust and control over those packages when being used in production scenarios.

Chocolatey installed package not on Path

I've been installing programs with chocolatey, but it's not adding them to my path automatically. Does anyone know a solution? I just followed the install instruction on Chocolatey's front page, and everything works well. The programs just aren't being added to the path.
It depends on what you install, and whether those native installers add themselves to the path in some cases.
If the package maintainer doesn't take the extra step in the cases where the installer doesn't add a program folder to the PATH, then those items may not be available on the command line.
The other side of this is that those items may be in PATH, but not to your current shell (cmd/powershell/whatever). This is due to how Windows works versus terminal in *nix. We've made some improvements there but it's not perfect. Expect things to get better over time in that aspect.
We have one issue out for ensuring that we create the User PATH correctly in the registry. This might be what is causing the issue for any items that may be adding themselves to this PATH instead of the system PATH.

Resources