I am running apache beam pipeline on dataflow with 3 Pypi packages defined in requirements.txt file. When I am running my pipeline with option "--requirements_file=requirements.txt", it submit below command to download Pypi packages.
python -m pip download --dest /tmp/requirements-cache -r requirements.txt --exists-action i --no-binary :all:
This command takes huge time to download the packages. I tried running it manually as well,it runs forever.
Why apache beam is using --no-binary :all: option, this is the root cause of long duration. Am I doing some mistake or any other way we can decrease the pip download time?
This is because the packages need to be installed on the workers, and it doesn't want to download binaries specific to whatever machine you happen to be launching the pipeline from.
If you have a lot of dependencies, the best solution is to use custom containers.
Related
When developing a lambda using sam cli, I often need to run sam build which create a new image from my Dockerfile
This process include fetching all the dependency:
RUN python3.8 -m pip install -r requirements.txt -t .
And takes a lot of time, I'm looking for a way to avoid this step, but couldn't find any info about it
I figure this solution, run sam build, and use the generated image as a base for the next run (and remove the pip install cmd)
it is working, but I was wondering if it is a good practice or can it cause problems later?
any other solutions?
I read that on OS X you can install Yarn either by
curl -o- -L https://yarnpkg.com/install.sh | bash
brew install yarn
npm i -g yarn
What functional difference is there between these three methods? Why would someone choose one over the others?
when using brew to install packages, you install them system wide. that is, you cannot have more than one version for the same package, which is usually problematic. for this reason, many other technologies have spawn, such as docker, snap.
moreover, each package manager has its own lifecycle and packs original package in a different manner for ease of use, distribution and maintenance. for instance, npm container is based on the release of npm package itself.
usually, you should stick to the package manager of the same ecosystem that you are using. specifically to your case, it will be recommended to use npm to install and update your package (using package.json). which will give each of your project to pin and lock the desired yarn version that you like, without any affecting your system wide.
speaking of npm, you might wish to look at this answer
curl downloads the installation script from yarnpkg.com, and installs yarn using that script
brew is a package manager for MacOS. It's meant to make things easier for people when installing commands for the terminal. When you install with brew, the package get's put into /usr/local/bin instead of /usr/bin so I believe that this is kind of like a virtual environment, and yarn wouldn't be installed into the core of your machine. You'll have to install homebrew before you can use it, and you install it using curl. I believe that there is less risk when using homebrew because of the reasons that it is kind of like a virtual environment
npm is a package manager for javascript, it's the same as yarn. It's meant for easy installation of javascript packages.
I use brew for all installs to the terminal, and npm for all installs of javascript packages.
I want to use the package "Dask", but there is one problem.
"Dask dataframe requirements are not installed."
Obviously, we can use pip install "dask[dataframe]" or pip install "dask[complete]".
However, in the secured server where I work, there is no internet connection.
So, I transfer the file of package and install manually.
But, I cannot find the package dask[dataframe] for downloading.
How can I install the rest of packages manually without internet connection?
Thank you
You should look at the setup.py requirements file in the dask repository to see which dependencies it requires.
Say you regularly use a large python dependency like tensorflow but you want to create siloed virtual environments for each separate project.
If I download and install tensorflow to my system using pip, is there a way to tell pipenv to use the previously downloaded dependency instead of doing a slow and high-bandwidth re-download per virtual environment I set up?
I would like to create a Conda environment from a .yaml file on an offline machine (i.e. no Internet access). On an online machine this works perfectly fine:
conda env create -f environment.yaml
However, it doesn't work on an offline machine as the packages are then not found. How do I do this?
If that's not possible is there another easy way to get my complete Conda environment to an offline machine (including both Conda and pip installed packages)?
Going through the packages one by one to install them from the .tar.bz2 files works, but it is quite cumbersome, so I would like to avoid that.
If you can use pip to install the packages, you should take a look at devpi, particutlarily its server. devpi can cache packages normally installed from PyPI, so only on first install it actually retrieves them. You have to configure pip to retrieve the packages from the devpi server.
As you don't want to list all the packages and their dependencies by hand you should, on a machine connected to the internet:
install the devpi server (I have that running in a Docker container)
run your installation
examine the devpi repository and gathered all the .tar.bz2 and .whl files out of there (you might be able to tar the whole thing)
On the non-connected machine:
Install the devpi server and client
use the devpi client to upload all the packages you gathered (using devpi upload) to the devpi server
make sure you have pip configured to look at the devpi server
run pip, it will find all the packages on the local server.
devpi has a small learning curve, which already worth traversing because of the speed up and the ability to install private packages (i.e. not uploaded to PyPI) as a normal dependency, by just generating the package and upload it to your local devpi server.
I guess that Anthon's solution above is pretty good but just in case anybody is interested in an easy solution that worked for me:
I first created a .yaml file specifying the environment using conda env export > file.yaml. Following the instructions on http://support.esri.com/en/technical-article/000014951, I automatically downloaded all the necessary installation files for conda installed packages and created a channel from the files. For that, I just adapted the code from the link above to work with my .yaml file instead of the conda list file they used. In addition, I automatically downloaded the necessary files for the pip installed packages by looping through the pip entries in the .yaml file and using pip download for downloading each of them. Furthermore, I automatically created separate conda and pip requirement lists from the .yaml file. Then I created the environment using conda create with the offline flag, the file with the conda requirements and my custom channel. Finally, I installed the pip requirements using pip install with the pip requirements file and the folder containing the pip installation files for the option --find-links.
That worked for me. The only problem is that you can only download binaries with pip download if you need to specify a different operating system than the one you are running, and for some packages no binaries are available. That was okay for me now as the target machine has the some characteristics but might be problem in the future, so I am planning to look into the solution suggested by Anthon.