Easiest way to install Python dependencies on Spark executor nodes? - hadoop

I understand that you can send individual files as dependencies with Python Spark programs. But what about full-fledged libraries (e.g. numpy)?
Does Spark have a way to use a provided package manager (e.g. pip) to install library dependencies? Or does this have to be done manually before Spark programs are executed?
If the answer is manual, then what are the "best practice" approaches for synchronizing libraries (installation path, version, etc.) over a large number of distributed nodes?

Actually having actually tried it, I think the link I posted as a comment doesn't do exactly what you want with dependencies. What you are quite reasonably asking for is a way to have Spark play nicely with setuptools and pip regarding installing dependencies. It blows my mind that this isn't supported better in Spark. The third-party dependency problem is largely solved in general purpose Python, but under Spark, it seems the assumption is you'll go back to manual dependency management or something.
I have been using an imperfect but functional pipeline based on virtualenv. The basic idea is
Create a virtualenv purely for your Spark nodes
Each time you run a Spark job, run a fresh pip install of all your own in-house Python libraries. If you have set these up with setuptools, this will install their dependencies
Zip up the site-packages dir of the virtualenv. This will include your library and it's dependencies, which the worker nodes will need, but not the standard Python library, which they already have
Pass the single .zip file, containing your libraries and their dependencies as an argument to --py-files
Of course you would want to code up some helper scripts to manage this process. Here is a helper script adapted from one I have been using, which could doubtless be improved a lot:
#!/usr/bin/env bash
# helper script to fulfil Spark's python packaging requirements.
# Installs everything in a designated virtualenv, then zips up the virtualenv for using as an the value of
# supplied to --py-files argument of `pyspark` or `spark-submit`
# First argument should be the top-level virtualenv
# Second argument is the zipfile which will be created, and
# which you can subsequently supply as the --py-files argument to
# spark-submit
# Subsequent arguments are all the private packages you wish to install
# If these are set up with setuptools, their dependencies will be installed
VENV=$1; shift
ZIPFILE=$1; shift
PACKAGES=$*
. $VENV/bin/activate
for pkg in $PACKAGES; do
pip install --upgrade $pkg
done
TMPZIP="$TMPDIR/$RANDOM.zip" # abs path. Use random number to avoid clashes with other processes
( cd "$VENV/lib/python2.7/site-packages" && zip -q -r $TMPZIP . )
mv $TMPZIP $ZIPFILE
I have a collection of other simple wrapper scripts I run to submit my spark jobs. I simply call this script first as part of that process and make sure that the second argument (name of a zip file) is then passed as the --py-files argument when I run spark-submit (as documented in the comments). I always run these scripts, so I never end up accidentally running old code. Compared to the Spark overhead, the packaging overhead is minimal for my small scale project.
There are loads of improvements that could be made – eg being smart about when to create a new zip file, splitting it up two zip files, one containing often-changing private packages, and one containing rarely changing dependencies, which don't need to be rebuilt so often. You could be smarter about checking for file changes before rebuilding the zip. Also checking validity of arguments would be a good idea. However for now this suffices for my purposes.
The solution I have come up with is not designed for large-scale dependencies like NumPy specifically (although it may work for them). Also, it won't work if you are building C-based extensions, and your driver node has a different architecture to your cluster nodes.
I have seen recommendations elsewhere to just run a Python distribution like Anaconda on all your nodes since it already includes NumPy (and many other packages), and that might be the better way to get NumPy as well as other C-based extensions going. Regardless, we can't always expect Anaconda to have the PyPI package we want in the right version, and in addition you might not be able to control your Spark environment to be able to put Anaconda on it, so I think this virtualenv-based approach is still helpful.

Related

Install package with pip that has additional qualifiers

Our CI pipeline publishes wheels for branches that have a version of
<base version>-dev<timestamp>+<branch name>.p<pipeline id>
So if I am working on cool-stuff on in the xyzzy branch, it might upload a wheel for version 1.2.3.dev202211221111+xyzzy.p1234
Somebody else, working in the foobar branch, might cause 1.2.3.dev202211221115+foobar.p1235 to be created.
How can I get pip to install the latest version from the xyzzy branch? I tried pip install cool-stuff>1.2.3.dev*+xyzzy but it complained that it could not find a matching version (even though the available versions that it listed included a +xyzzy tag.)
pip install cool-stuff==1.2.3.dev202211221111+xyzzy.p1234 did work, but I would prefer not have have to update the time stamp and pipeline number each time. I am hoping to put cool-stuff >= <magic> in my config file and just run pip install -e . whenever I need new dependencies.
What format do I need to use here?
As far as I know, it is not possible. I can not think of a workable solution that would be based on the version string only. The "local" part of the version string (in other words the part after the plus sign +) can not be used to differentiate between two releases as you intend to.
If I were in your situation, I think I would investigate a solution where the CI/CD pipelines generate distributions with a name customized according to the git branch. For example in your case the pipelines should generate wheels for Library-foobar or Library-xyzzy, depending on what branch is currently being worked on (while still keeping the same top-level import names of course). This assumes that you can customize your pipelines and processes deeply enough to support such a workflow.

Automate creation of a pyenv through sh scripts

I'm running a project that uses pip and a requirements.txt file to install and keep track of some dependencies. I want to write some sh scripts to run, build and test the application. For starters I would like a way to check if the current folder is in a pyenv and, if not, create one to enclose the application and not mess around other people's dependencies. Also, I would like an opinion of the best way to keep track of this kind of dependencies, if the requirements.txt is a good approach and if there's a way to keep track of installed versions just like happens with node packages.
Use Pipenv. It's a better way of tracking your depencies than requirements.txt and it uses Pyenv to automatically install your project's required Python version.
From the website:
The problems that Pipenv seeks to solve are multi-faceted:
You no longer need to use pip and virtualenv separately. They work together.
Managing a requirements.txt file can be problematic, so Pipenv uses Pipfile and Pipfile.lock to separate abstract dependency
declarations from the last tested combination.
Hashes are used everywhere, always. Security. Automatically expose security vulnerabilities.
Strongly encourage the use of the latest versions of dependencies to minimize security risks arising from outdated components.
Give you insight into your dependency graph (e.g. $ pipenv graph).
Streamline development workflow by loading .env files.
[...]
Pipenv Features
Enables truly deterministic builds, while easily specifying only what you want.
Generates and checks file hashes for locked dependencies.
Automatically install required Pythons, if pyenv is available.
Automatically finds your project home, recursively, by looking for a Pipfile.
Automatically generates a Pipfile, if one doesn’t exist.
Automatically creates a virtualenv in a standard location.
Automatically adds/removes packages to a Pipfile when they are installed or uninstalled.
Automatically loads .env files, if they exist.

Nix: installing ssreflect

I am using Coq (versions 8.5-6), installed w/ Nix. I want to install ssreflect, preferably also w/ Nix. The only information I found about this is here. However, this is not about installing ssreflect, merely trying it out. Nevertheless, I tried to try it out, but ended up w/ hundreds of warnings (about the contents of various .v and .ml4 files) and couldn't wait for the process to end. A fairly typical warning looked like this:
File "./algebra/ssralg.v", line 856, characters 0-39: Warning:
Implicit Arguments is deprecated; use Arguments instead
So the question: How on earth do I install ssreflect w/ Nix?
EDIT: After reading ejgallego's comments, it seems it may be impossible to install ssreflect w/ Nix -- esp. if one wants install only ssreflect w/out the other modules (fingroup, algebra, etc.). So I've also the following question:
Would the standard Opam or make install installation of ssreflect work w/ a Nix-installed Coq?
There are a few things that you need to be aware of:
Nix is a source-based package manager with a binary cache. A lot of packages are pre-built and available in the binary cache, thus their installation doesn't take long; some packages (in particular development libraries) are not pre-built and Nix, when installing them will take the time it needs to compile them. Please be patient: you will only need to wait for the full compilation the first time (and yes, math-comp emits lots of warning upon compilation); next times, the package will be already available in your local Nix store.
Since OPAM is also source-based, using OPAM instead of Nix won't make you save time. You can't mixup Nix-installed Coq with OPAM installed SSReflect because the latter will want the former as an OPAM dependency.
The Nix way to use libraries is not to install them but to load them with nix-shell instead. nix-shell will "install" the libraries and set some environment variables for you (e.g. $COQPATH in this case).
You can also compile the package from source yourself using a Nix-installed Coq but you cannot run make install because this would try to install SSReflect at the same place where Coq is installed but the Nix store is non-mutable. Instead you could skip this step, and set up $COQPATH manually.
Indeed, the compilation of the full math-comp takes very long. There is a Coq ssreflect package which is lighter. You can get it using:
nix-shell -p coqPackages_8_6.ssreflect

How do I set up a lambda function in AWS to install certain packages and dependencies

I know there are deployment packages but what are these? Can they contain bash scripts that do apt-get install?
Is there any way to build a lambda function that uses a particular AMI
You can run shell commands from inside your code(so technically you could run shell scripts) but you are charged for the time it takes your Lambda to execute - so installing a bunch of dependencies every time the Lambda starts up would be considered an anti-pattern.
You need to bundle all the packages and dependencies with your Lambda. This is done by uploading a zip file that contains the lambda function and all the dependencies.
You can see the official docs for the various supported languages here:
http://docs.aws.amazon.com/lambda/latest/dg/nodejs-create-deployment-pkg.html
http://docs.aws.amazon.com/lambda/latest/dg/lambda-java-how-to-create-deployment-package.html
http://docs.aws.amazon.com/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html
I think your are looking for something like lambda-uploader. You can list out the python packages required by your lambda. If you come across a package that requires a couple of library files to run, you can include them as well. Like for example, the mysql-python package requires libmysqlclient.so and _mysql.so files to run properly.
It generates the .zip file for you and deletes it once it has been uploaded. This way, you can avoid manual packaging steps and make deployment a breeze.

Packaging Go application for Debian

How can I put my Go binary into a Debian package? Since Go is statically linked, I just have a single executable--I don't need a lot of complicated project metadata information. Is there a simple way to package the executable and resource files without going through the trauma of debuild?
I've looked all over for existing questions; however, all of my research turns up questions/answers about a .deb file containing the golang development environment (i.e., what you would get if you do sudo apt-get install golang-go).
Well. I think the only "trauma" of debuild is that it runs lintian after building the package, and it's lintian who tries to spot problems with your package.
So there are two ways to combat the situation:
Do not use debuild: this tool merely calls dpkg-buildpackage which really does the necessary powerlifting. The usual call to build a binary package is dpkg-buildpackage -us -uc -b. You still might call debuild for other purposes, like debuild clean for instance.
Add the so-called "lintian override" which can be used to make lintian turn a blind eye to selected problems with your package which, you insist, are not problems.
Both approaches imply that you do not attempt to build your application by the packaging tools but rather treat it as a blob which is just wrapped to a package. This would require slightly abstraining from the normal way debian/rules work (to not attempt to build anything).
Another solution which might be possible (and is really way more Debian-ish) is to try to use gcc-go (plus gold for linking): since it's a GCC front-end, this tool produces a dynamically-linked application (which links against libgo or something like this). I, personally, have no experience with it yet, and would only consider using it if you intend to try to push your package into the Debian proper.
Regarding the general question of packaging Go programs for Debian, you might find the following resources useful:
This thread started on go-nuts by one of Go for Debian packagers.
In particular, the first post in that thread links to this discussion on debian-devel.
The second thread on debian-devel regarding that same problem (it's a logical continuation of the former thread).
Update on 2015-10-15.
(Since this post appears to still be searched and found and studied by people I've decided to update it to better reflec the current state of affairs.)
Since then the situation with packaging Go apps and packages got improved dramatically, and it's possible to build a Debian package using "classic" Go (the so-called gc suite originating from Google) rather than gcc-go.
And there exist a good infrastructure for packages as well.
The key tool to use when debianizing a Go program now is dh-golang described here.
I've just been looking into this myself, and I'm basically there.
Synopsis
By 'borrowing' from the 'package' branch from one of Canonical's existing Go projects, you can build your package with dpkg-buildpackage.
install dependencies and grab a 'package' branch from another repo.
# I think this list of packages is enough. May need dpkg-dev aswell.
sudo apt-get install bzr debhelper build-essential golang-go
bzr branch lp:~niemeyer/cobzr/package mypackage-build
cd mypackage-build
Edit the metadata.
edit debian/control file (name, version, source). You may need to change the golang-stable dependency to golang-go.
The debian/control file is the manifest. Note the 'build dependencies' (Build-Depends: debhelper (>= 7.0.50~), golang-stable) and the 3 architectures. Using Ubuntu (without the gophers ppa), I had to change golang-stable to golang-go.
edit debian/rules file (put your package name in place of cobzr).
The debian/rules file is basically a 'make' file, and it shows how the package is built. In this case they are relying heavily on debhelper. Here they set up GOPATH, and invoke 'go install'.
Here's the magic 'go install' line:
cd $(GOPATH)/src && find * -name '*.go' -exec dirname {} \; | xargs -n1 go install
Also update the copyright file, readme, licence, etc.
Put your source inside the src folder. e.g.
git clone https://github.com/yourgithubusername/yourpackagename src/github.com/yourgithubusername/yourpackagename
or e.g.2
cp .../yourpackage/ src/
build the package
# -us -uc skips package signing.
dpkg-buildpackage -us -uc
This should produce a binary .deb file for your architecture, plus the 'source deb' (.tgz) and the source deb description file (.dsc).
More details
So, I realised that Canonical (the Ubuntu people) are using Go, and building .deb packages for some of their Go projects. Ubuntu is based on Debian, so for the most part the same approach should apply to both distributions (dependency names may vary slightly).
You'll find a few Go-based packages in Ubuntu's Launchpad repositories. So far I've found cobzr (git-style branching for bzr) and juju-core (a devops project, being ported from Python).
Both of these projects have both a 'trunk' and a 'package' branch, and you can see the debian/ folder inside the package branch. The 2 most important files here are debian/control and debian/rules - I have linked to 'browse source'.
Finally
Something I haven't covered is cross-compiling your package (to the other 2 architectures of the 3, 386/arm/amd64). Cross-compiling isn't too tricky in go (you need to build the toolchain for each target platform, and then set some ENV vars during 'go build'), and I've been working on a cross-compiler utility myself. Eventually I'll hopefully add .deb support into my utility, but first I need to crystallize this task.
Good luck. If you make any progress then please update my answer or add a comment. Thanks
Building deb or rpm packages from Go Applications is also very easy with fpm.
Grab it from rubygems:
gem install fpm
After building you binary, e.g. foobar, you can package it like this:
fpm -s dir -t deb -n foobar -v 0.0.1 foobar=/usr/bin/
fpm supports all sorts of advanced packaging options.
There is an official Debian policy document describing the packaging procedure for Go: https://go-team.pages.debian.net/packaging.html
For libraries: Use dh-make-golang to create a package skeleton. Name your package with a name derived from import path, with a -dev suffix, e.g. golang-github-lib-pq-dev. Specify the dependencies ont Depends: line. (These are source dependencies for building, not binary dependencies for running, since Go statically links all source.)
Installing the library package will install its source code to /usr/share/golang/src (possibly, the compiled libraries could go into .../pkg). Building depending Go packages will use the artifacts from those system-wide locations.
For executables: Use dh-golang to create the package. Specify dependencies in Build-Depends: line (see above regarding packaging the dependencies).
I recently discovered https://packager.io/ - I'm quite happy with what they're doing. Maybe open up one of the packages to see what they're doing?

Resources