running joblib.Parallel(mlxtend) does not scale in cloud-ml - parallel-processing

Im running a job using the mlxtend library. Specifically the sequential_feature_selector that is parallelized using joblib.Parallel source. When I run the package on my local computer it uses all the available CPUs, but when i send the job to cloud-ml it only uses one core. It doesn't matter what is the number that i put in the n_jobs parameter. I´ve also tried with differents machine types but same thing happen.
Does anybody know what the problem might be ?

For anyone that might be interested, we solve the problem fixing the sklearn version in the setup.py to the 0.20.2. we had sklearn in the packages before, but without a version.
#setup.py
from setuptools import find_packages
from setuptools import setup
REQUIRED_PACKAGES = ['joblib==0.13.0',
'scikit-learn==0.20.2',
'mlxtend']

Related

deleting stopwords with Gensim

I'm trying to learn Gensim using its site.
There is a function named 'remove_stopword_tokens' which is useful for my research.
Now, although the module is defined and is present on their website (exact link: link),I can't import it on my colab
Note: This is my code:
import gensim
from gensim.parsing.preprocessing import remove_stopword_tokens
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-2-dbd838c83237> in <module>
----> 1 from gensim.parsing.preprocessing import remove_stopword_tokens
ImportError: cannot import name 'remove_stopword_tokens' from 'gensim.parsing.preprocessing' (/usr/local/lib/python3.7/dist-packages/gensim/parsing/preprocessing.py)
---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.
To view examples of installing some common dependencies, click the
"Open Examples" button below.
updated & corrected answer
You've run into a limitation of Google Colab - it may not have the most-recent version of libraries.
You can see this by checking what the value of gensim.__version__ is. In my check of Google Colab right now (September 2022), it reports 3.6.0 – a version of Gensim that's about 4 years old, and lacks later fixes & addtions. The remove_stopwords_tokens() function was only added recently.
Fortunately, you can update the gensim package backing the Colab notebook yourself, using a shell-escape to run pip. Inside a Colab cell, run:
!pip install gensim -U
If you'd already done an import gensim, it will warn you that you must restart the runtime for the new code to be found.
Note that for clarity reasons you might choose to prefer using more-specific imports, as many project style guides suggest, rather than doing any broad top-level import gensim at all. Just mport the individual classes and/or functions you need, specifically & explicitly. That is, just:
from gensim.parsing.preprocessing import remove_stopword_tokens
# ... other exact class/function/variable imports you'll use...
remove_stopword_tokens(sentence)
On the other hand, if you want things simple-but-sloppy (not recommended), once you import gensim, it has already (via its own custom initialization routines) imported all of its submodules for you. So you could do:
import gensim # parsing & all gensim's other submodules now referenceable!
gensim.parsing.remove_stopword_tokens(sentence)
(Pro Python programmer style tends not to do this latter approach, of prefixing all in-the-actual-code calls with long dot-paths.)

Module not found when running geopandas identity

I'm attempting to conduct a vector identity using two shapefiles in geopandas. When this is run I get the following error.
ModuleNotFoundError: No module named 'geopandas.sindex'
I've done a quick search and there is no module called geopandas.sindex related to vector identity or anything like it. My geopandas is installed under anaconda and it is installed as I am able to import geopandas.
Here is an excerpt from my code.
import geopandas as gpd
geo_df1 = gpd.read_file("shapefile1.shp")
geo_df2 = gpd.read_file("shapefile2.shp")
geo_df3 = gpd.overlay(geo_df1,geo_df2,how="identity")`
I strongly suspect something has gone wrong in my installation as this operation was done correctly historically so any recommendations towards fixing the installation is appreciated. The expected result will be being able to run the geopandas operation successfully.

Install numpy in Python 2.7 without setting environment

I wanted to install numpy in python 2.7 without setting environment path. I do not know if that is possible or not but my Professor wants it like so please any advice would be appreciated.
I am not sure I understand your question correctly. You can simply delete python from your environment path. But normally this is not desirable since you then cannot call python from any directory. Better is to create a virtual environment. Or better use: anaconda. This will allow you to use various version of pythons in separate environments without any confusion or clashes between versions. You then install the respective numpy version within a specific environment. See: https://conda.io/docs/user-guide/tasks/manage-python.html
If you mean want to install numpy but you do not have the previleges then your answer can be found here: (Python) Use a library locally instead of installing it
I hope this helps. If not, then please clarify your question.

Scientific Python installation recommendations

I am new to Python and starting work on a large project that will be distributed to users. I am also the first in my company to be using, and I wanted to get recommendations on the best way to install Python & packages, so that I don't head off in the wrong direction.
I require data analysis frameworks (pandas, numpy, scipy, matplotlib, statsmodels, pymongo) and my initial approach was to install Python 3.5 directly, and then use pip install on each package.
I ran into similar problems that others have found [Unable to find vcvarsall], and resolved. Next problem was with BLAS and LAPACK missing when installing scipy. At this point I decided Anaconda was the way to go, rather than individual pip installs, and was easily able to set everything up.
One problem with Anaconda is that it installs a lot of packages which I will never use, and may not have some which I would like to use in future, e.g. TensorFlow (presumably can do pip install to get extra ones that are not included?).
An in-between solution seems to be Miniconda, which I believe would have fixed the BLAS/LAPACK problem with scipy.
So my question is: can someone with experience of developing data analysis projects in Python, that will be deployed to users' Windows desktops, and with server-side components running on Linux, provide recommendation of what they would do if starting from scratch at new organization?
(I'm currently in favour of heading down the Anaconda route.)
Personally, I think Anaconda(conda) is better. First of all, conda is cross-platform package manager, and it is easy to install and use. Second, conda has functionality of virtualenv, and you can use conda create to create environment. Finally, there is Anaconda cloud and condo-forge, those community can help you solve conda issue, build packages, and share ideas.
Moreover, Anaconda(conda) indeed install a lot of packages, but those are all dependencies. For example, when you "conda install scikit-learn", conda will automatically help you install the dependency, numpy and spicy.

OpenMDAO: First Steps

I am new in the world of OpenMDAO (and also on Python) and I am having some problems to understand the use of the software. I have already installed Anaconda (pyth v2.7) and the OpenMDAO, but I don't know how to run it. I am following this tutorial but I am not sure if I am doing it properly. I write the .py files in notepad++, and I try to run on the IPython but when I use the command : from paraboloid import Paraboloid it appears an error : No module named.api. I think that maybe I am not using the correct path (I'm in the folder where I have the .py files). Probably it's an stupid error, so sorry for the question.
Thank you all, Jose M O
If your tutorial link above is correct, I see that you are using a tutorial for OpenMDAO 0.1.0. That version is 5.5 years old at this time, and is no longer supported. We will be happy to help with your questions, but to get a better foundation, and a much more useful tool, please consider:
Install OpenMDAO 1.5.0 (pip install openmdao or read these installation docs)
Try this paraboloid tutorial instead.
Good luck,
Keith
NOTE: If you installed OpenMDAO 1.x.x and are using the tutorial from 0.1.0, you would have many problems with api imports, as many things have changed since 0.1.0.

Resources