Automate creation of a pyenv through sh scripts - pip

I'm running a project that uses pip and a requirements.txt file to install and keep track of some dependencies. I want to write some sh scripts to run, build and test the application. For starters I would like a way to check if the current folder is in a pyenv and, if not, create one to enclose the application and not mess around other people's dependencies. Also, I would like an opinion of the best way to keep track of this kind of dependencies, if the requirements.txt is a good approach and if there's a way to keep track of installed versions just like happens with node packages.

Use Pipenv. It's a better way of tracking your depencies than requirements.txt and it uses Pyenv to automatically install your project's required Python version.
From the website:
The problems that Pipenv seeks to solve are multi-faceted:
You no longer need to use pip and virtualenv separately. They work together.
Managing a requirements.txt file can be problematic, so Pipenv uses Pipfile and Pipfile.lock to separate abstract dependency
declarations from the last tested combination.
Hashes are used everywhere, always. Security. Automatically expose security vulnerabilities.
Strongly encourage the use of the latest versions of dependencies to minimize security risks arising from outdated components.
Give you insight into your dependency graph (e.g. $ pipenv graph).
Streamline development workflow by loading .env files.
[...]
Pipenv Features
Enables truly deterministic builds, while easily specifying only what you want.
Generates and checks file hashes for locked dependencies.
Automatically install required Pythons, if pyenv is available.
Automatically finds your project home, recursively, by looking for a Pipfile.
Automatically generates a Pipfile, if one doesn’t exist.
Automatically creates a virtualenv in a standard location.
Automatically adds/removes packages to a Pipfile when they are installed or uninstalled.
Automatically loads .env files, if they exist.

Related

Install package with pip that has additional qualifiers

Our CI pipeline publishes wheels for branches that have a version of
<base version>-dev<timestamp>+<branch name>.p<pipeline id>
So if I am working on cool-stuff on in the xyzzy branch, it might upload a wheel for version 1.2.3.dev202211221111+xyzzy.p1234
Somebody else, working in the foobar branch, might cause 1.2.3.dev202211221115+foobar.p1235 to be created.
How can I get pip to install the latest version from the xyzzy branch? I tried pip install cool-stuff>1.2.3.dev*+xyzzy but it complained that it could not find a matching version (even though the available versions that it listed included a +xyzzy tag.)
pip install cool-stuff==1.2.3.dev202211221111+xyzzy.p1234 did work, but I would prefer not have have to update the time stamp and pipeline number each time. I am hoping to put cool-stuff >= <magic> in my config file and just run pip install -e . whenever I need new dependencies.
What format do I need to use here?
As far as I know, it is not possible. I can not think of a workable solution that would be based on the version string only. The "local" part of the version string (in other words the part after the plus sign +) can not be used to differentiate between two releases as you intend to.
If I were in your situation, I think I would investigate a solution where the CI/CD pipelines generate distributions with a name customized according to the git branch. For example in your case the pipelines should generate wheels for Library-foobar or Library-xyzzy, depending on what branch is currently being worked on (while still keeping the same top-level import names of course). This assumes that you can customize your pipelines and processes deeply enough to support such a workflow.

Mark a pip dependency as explicit installed

I want to differentiate between packages that I have explicitly installed, and packages pulled in as dependency. You can do that by using the --not-required option:
pip3 list --not-required --format freeze
However if I have a package that requires for example the requests package, then it will be automatically pulled in, if installed via requirements.txt. Installing requests via pip install requests will not put it in the list of --not-required packages. Not even adding it to the requirements.txt file would help setting this packages as required.
It seems that pip will always exclude those sub dependencies and only print those packages that are not dependent by another package. Is that true? How could I work around that without adding additional dependencies for package management. It seems that there is no such clever builtin option, right?

How to enforce bundle install location

I come from a Python and JavaScript background.
When developing a JavaScript project, dependencies are installed in a node_modules directory in the project root.
When developing Python project, typically virtualenvwrapper is used. In this case dependencies are installed in a virtual environment, which is located in ~/.virtualenvs/<project_name> by default.
Now I need to use a ruby tool for a project. The tool that appears to be the most promising for a similar setup as described above, is bundler.
However, the default installation location for bundler is system-wide. I consider this to be harmful.
For one of my systems, it will prompt for a password, at which point I can still abort.
However, for my other system I can write into the global ruby installation. I'm using a homebrew installed ruby here. Bundle will just install dependencies globally.
I know I can specify the installation location by adding --path, but this is easy to forget.
One way to enforce an installation path is by committing .bundle/config. It would just have to contain this:
---
BUNDLE_PATH: "."
However, some googling around shows that it's not adviced to commit this file.
What is the recommended way to prevent accidental global installations using bundler?
Who's to say it will be accidental? It really depends on what context you're talking about here. I have my Ruby set up so that bundle install works without requiring sudo, it's all done through rbenv automatically. The same is true with rvm if done as a user-level install.
When it comes to deploying apps and you want to make sure it's deployed correctly, that's where tools like Capistrano come into play: Create a deployment script that will apply the correct procedure every time.
Checking in a .bundle/config is really rude from a dev perspective, just like checking in any other user-specific preferences you might have. It causes no end of conflict with other team members.

Should I commit the yarn.lock file and what is it for?

Yarn creates a yarn.lock file after you perform a yarn install.
Should this be committed to the repository or ignored? What is it for?
Yes, you should check it in, see Migrating from npm
What is it for?
The npm client installs dependencies into the node_modules directory non-deterministically. This means that based on the order dependencies are installed, the structure of a node_modules directory could be different from one person to another. These differences can cause works on my machine bugs that take a long time to hunt down.
Yarn resolves these issues around versioning and non-determinism by using lock files and an install algorithm that is deterministic and reliable. These lock files lock the installed dependencies to a specific version and ensure that every install results in the exact same file structure in node_modules across all machines.
Depends on what your project is:
Is your project an application? Then: Yes
Is your project a library? If so: No
A more elaborate description of this can be found in this GitHub issue where one of the creators of Yarn eg. says:
The package.json describes the intended versions desired by the original author, while yarn.lock describes the last-known-good configuration for a given application.
Only the yarn.lock-file of the top level project will be used. So unless ones project will be used standalone and not be installed into another project, then there's no use in committing any yarn.lock-file – instead it will always be up to the package.json-file to convey what versions of dependencies the project expects then.
I see these are two separate questions in one. Let me answer both.
Should you commit the file into repo?
Yes. As mentioned in ckuijjer's answer it is recommended in Migration Guide to include this file into repo. Read on to understand why you need to do it.
What is yarn.lock?
It is a file that stores the exact dependency versions for your project together with checksums for each package. This is yarn's way to provide consistency for your dependencies.
To understand why this file is needed you first need to understand what was the problem behind original NPM's package.json. When you install the package, NPM will store the range of allowed revisions of a dependency instead of a specific revision (semver). NPM will try to fetch update the dependency latest version of dependency within the specified range (i.e. non-breaking patch updates). There are two problems with this approach.
Dependency authors might release patch version updates while in fact introducing a breaking change that will affect your project.
Two developers running npm install at different times may get the different set of dependencies. Which may cause a bug to be not reproducible on two exactly same environments. This will might cause build stability issues for CI servers for example.
Yarn on the other hand takes the route of maximum predictability. It creates yarn.lock file to save the exact dependency versions. Having that file in place yarn will use versions stored in yarn.lock instead of resolving versions from package.json. This strategy guarantees that none of the issues described above happen.
yarn.lock is similar to npm-shrinkwrap.json that can be created by npm shrinkwrap command. Check this answer explaining the differences between these two files.
You should:
add it to the repository and commit it
use yarn install --frozen-lockfile and NOT yarn install as a default both locally and on CI build servers.
(I opened a ticket on yarn's issue tracker to make a case to make frozen-lockfile default behavior, see #4147).
Beware to NOT set the frozen-lockfile flag in the .yarnrc file as that would prevent you from being able to sync the package.json and yarn.lock file. See the related yarn issue on github
yarn install may mutate your yarn.lock unexpectedly, making yarn claims of repeatable builds null and void. You should only use yarn install to initialize a yarn.lock and to update it.
Also, esp. in larger teams, you may have a lot of noise around changes in the yarn lock only because a developer was setting up their local project.
For further information, read upon my answer about npm's package-lock.json as that applies here as well.
This was also recently made clear in the docs for yarn install:
yarn install
Install all the dependencies listed within package.json
in the local node_modules folder.
The yarn.lock file is utilized as follows:
If yarn.lock is present and is enough to satisfy all the dependencies
listed in package.json, the exact versions recorded in yarn.lock are
installed, and yarn.lock will be unchanged. Yarn will not check for
newer versions.
If yarn.lock is absent, or is not enough to satisfy
all the dependencies listed in package.json (for example, if you
manually add a dependency to package.json), Yarn looks for the newest
versions available that satisfy the constraints in package.json. The
results are written to yarn.lock.
If you want to ensure yarn.lock is not updated, use --frozen-lockfile.
From My experience I would say yes we should commit yarn.lock file. It will ensure that, when other people use your project they will get the same dependencies as your project expected.
From the Doc
When you run either yarn or yarn add , Yarn will generate a yarn.lock file within the root directory of your package. You don’t need to read or understand this file - just check it into source control. When other people start using Yarn instead of npm, the yarn.lock file will ensure that they get precisely the same dependencies as you have.
One argue could be, that we can achieve it by replacing ^ with --. Yes we can, but in general, we have seen that majority of npm packages comes with ^ notation, and we have to change notation manually for ensuring static dependency version.But if you use yarn.lock it will programatically ensure your correct version.
Also as Eric Elliott said here
Don’t .gitignore yarn.lock. It is there to ensure deterministic dependency resolution to avoid “works on my machine” bugs.
Not to play the devil's advocate, but I have slowly (over the years) come around to the idea that you should NOT commit the lock files.
I know every bit of documentation they have says that you should. But what good can it possibly do?! And the downsides far outweigh the benefits, in my opinion.
Basically, I have spent countless hours debugging issues that have eventually been solved by deleting lock files. For example, the lock files can contain information about which package registry to use, and in an enterprise environment where different users access different registries, it's a recipe for disaster.
Additionally, the lock files can really mess up your dependency tree. Because yarn and npm create a complex tree and keep external modules of different versions in different places (e.g. in the node_modules folder within a module in the top node_modules folder of your app), if you update dependencies frequently, it can create a real mess. Again, I have spent tons of time trying to figure out what an old version of a module was still being used in a dependency wherein the module version had been updated, only to find that deleting the lock file and the node_modules folder solved all the hard-to-diagnose problems.
I even have shell aliases now that delete the lock files (and sometimes node_modules folders as well!) before running yarn or npm.
Just the other side of the coin, I guess, but blindly following this dogma can cost you........
I'd guess yes, since Yarn versions its own yarn.lock file:
https://github.com/yarnpkg/yarn
It's used for deterministic package dependency resolution.
Yes! yarn.lock must be checked in so any developer who installs the dependencies get the exact same output! With npm [that was available in Oct 2016], for instance, you can have a patch version (say 1.2.0) installed locally while a new developer running a fresh install might get a different version (1.2.1).
Yes, You should commit it. For more about yarn.lock file, refer the official docs here

Easiest way to install Python dependencies on Spark executor nodes?

I understand that you can send individual files as dependencies with Python Spark programs. But what about full-fledged libraries (e.g. numpy)?
Does Spark have a way to use a provided package manager (e.g. pip) to install library dependencies? Or does this have to be done manually before Spark programs are executed?
If the answer is manual, then what are the "best practice" approaches for synchronizing libraries (installation path, version, etc.) over a large number of distributed nodes?
Actually having actually tried it, I think the link I posted as a comment doesn't do exactly what you want with dependencies. What you are quite reasonably asking for is a way to have Spark play nicely with setuptools and pip regarding installing dependencies. It blows my mind that this isn't supported better in Spark. The third-party dependency problem is largely solved in general purpose Python, but under Spark, it seems the assumption is you'll go back to manual dependency management or something.
I have been using an imperfect but functional pipeline based on virtualenv. The basic idea is
Create a virtualenv purely for your Spark nodes
Each time you run a Spark job, run a fresh pip install of all your own in-house Python libraries. If you have set these up with setuptools, this will install their dependencies
Zip up the site-packages dir of the virtualenv. This will include your library and it's dependencies, which the worker nodes will need, but not the standard Python library, which they already have
Pass the single .zip file, containing your libraries and their dependencies as an argument to --py-files
Of course you would want to code up some helper scripts to manage this process. Here is a helper script adapted from one I have been using, which could doubtless be improved a lot:
#!/usr/bin/env bash
# helper script to fulfil Spark's python packaging requirements.
# Installs everything in a designated virtualenv, then zips up the virtualenv for using as an the value of
# supplied to --py-files argument of `pyspark` or `spark-submit`
# First argument should be the top-level virtualenv
# Second argument is the zipfile which will be created, and
# which you can subsequently supply as the --py-files argument to
# spark-submit
# Subsequent arguments are all the private packages you wish to install
# If these are set up with setuptools, their dependencies will be installed
VENV=$1; shift
ZIPFILE=$1; shift
PACKAGES=$*
. $VENV/bin/activate
for pkg in $PACKAGES; do
pip install --upgrade $pkg
done
TMPZIP="$TMPDIR/$RANDOM.zip" # abs path. Use random number to avoid clashes with other processes
( cd "$VENV/lib/python2.7/site-packages" && zip -q -r $TMPZIP . )
mv $TMPZIP $ZIPFILE
I have a collection of other simple wrapper scripts I run to submit my spark jobs. I simply call this script first as part of that process and make sure that the second argument (name of a zip file) is then passed as the --py-files argument when I run spark-submit (as documented in the comments). I always run these scripts, so I never end up accidentally running old code. Compared to the Spark overhead, the packaging overhead is minimal for my small scale project.
There are loads of improvements that could be made – eg being smart about when to create a new zip file, splitting it up two zip files, one containing often-changing private packages, and one containing rarely changing dependencies, which don't need to be rebuilt so often. You could be smarter about checking for file changes before rebuilding the zip. Also checking validity of arguments would be a good idea. However for now this suffices for my purposes.
The solution I have come up with is not designed for large-scale dependencies like NumPy specifically (although it may work for them). Also, it won't work if you are building C-based extensions, and your driver node has a different architecture to your cluster nodes.
I have seen recommendations elsewhere to just run a Python distribution like Anaconda on all your nodes since it already includes NumPy (and many other packages), and that might be the better way to get NumPy as well as other C-based extensions going. Regardless, we can't always expect Anaconda to have the PyPI package we want in the right version, and in addition you might not be able to control your Spark environment to be able to put Anaconda on it, so I think this virtualenv-based approach is still helpful.

Resources