Python error using pyarrow - ArrowNotImplementedError: Support for codec 'snappy' not built - parquet

Using Python, Parquet, and Spark and running into ArrowNotImplementedError: Support for codec 'snappy' not built after upgrading to pyarrow=3.0.0. My previous version without this error was pyarrow=0.17. The error does not appear in pyarrow=1.0.1 and does appear in pyarrow=2.0.0. The idea is to write a pandas DataFrame as a Parquet Dataset (on Windows) using Snappy compression, and later to process the Parquet Dataset using Spark.
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame({
'x': [0, 0, 0, 1, 1, 1],
'a': np.random.random(6),
'b': np.random.random(6)})
table = pa.Table.from_pandas(df, preserve_index=False)
pq.write_to_dataset(table, root_path=r'c:/data', partition_cols=['x'], flavor='spark')

Something is wrong with the conda install pyarrow method. I removed it with conda remove pyarrow and after that installed it with pip install pyarrow. This ended up working.

The pyarrow package you had installed did not come from conda-forge and it does not appear to match the package on PYPI. I did a bit more research and pypi_0 just means the package was installed via pip. It does not mean it actually came from PYPI.
I'm not really sure how this happened. You could maybe check your conda log (envs/YOUR-ENV/conda-meta/history) but, given that this was installed external from conda, I'm not sure there will be any meaningful information in there. Perhaps you tried to install Arrow after the version was bumped to 3 and before the wheels were uploaded and so your system fell back to building from source?

I had the exact same issue. Did fresh install of Anaconda 3.8. then did conda install -c conda-forge pyarrow from this link "https://anaconda.org/conda-forge/pyarrow". It chokes through this install but fails with frozen/flexible solve and conda keeps trying different variants until finally it installs. You can then import pyarrow. But then, when you try to open a parquet file, you get the 'snappy' codec error - the subject of this thread.
I then did conda remove pyarrow so I was back to a clean install. Then pip install pyarrow, and I could successfully load the parquet file.

I managed to get it to work by doing a pip install pyArrow from Conda prompt.

I'm not 100%, but it could be because since version 1.0.0 they slimmed down the default arrow build and snappy became an optional component, see
I think you would have to rebuild arrow using -DARROW_WITH_SNAPPY=ON, see. But this can be quite difficult and tedious to get to work.
Another option would be to disable snappy:
pq.write_to_dataset(table, root_path=r'c:/data', partition_cols=['x'], flavor='spark', compression="NONE")

Related

No available image handler could decode this transfer syntax JPEG Lossless when read DICOM and ploting using matplotlib

When i use pydicom in python3.6, there are some problem:
import pydicom
import matplotlib.pyplot as plt
import os
import pylab
filePath = "/Users/zhuangrui/Documents/Python/Dicom/dicoms/zhang_bo/0001.dcm"
dataSet_1 = pydicom.dcmread(filePath)
plt.imshow(dataSet_1.pixel_array)
plt.show()
here is the problem:
How can this problem be solved? Thank you very much!
I've faced with the same problem, after doing some research on the suggested link above. I've managed to solve it by updating to the latest pydicom module "1.2.0" and installing gdcm. You can update the pydicom with
pip install -U git+https://github.com/pydicom/pydicom.git
You can find the latest gdcm here and this link explains the installation.
I use anaconda and it's easier to install the gdcm package and solve the problem. If you use anaconda
just type inside from your environment:
conda install pydicom --channel conda-forge to get pydicom's latest and
conda install -c conda-forge gdcm
to get the gdcm. This resolves the problem. Hope these will help.
With pydicom, you need an appropriate image handler also installed to handle compressed image types.
For JPEG lossless, in theory the following should work: jpeg_ls, gdcm, or Pillow with jpeg plugin. All of these also require Numpy to be installed. See the discussion at https://github.com/pydicom/pydicom/issues/532.
There is also a pull request in progress to add more descriptive error messages for what image handlers are needed for different images.
Problem:
I was trying to read medical images with .dcm extension. But was getting an error on Windows as well as on Ubuntu. I find a solution which will work on both the machined.
The error I got on Ubuntu is: NotImplementedError: this transfer syntax JPEG 2000 Image Compression (Lossless Only), can not be read because Pillow lacks the jpeg 2000 decoder plugin
(Note for Windows I was getting a different error but I am sure it's because of the same issue i.e. Pillow does not support JPEG 2000 format)
Platforma Information:
I am using: Python 3.6, Anaconda and Ubuntu, 15 GB RAM
RAM is important:
The solution I applied is the same as Ali explained above. But I want to add this installation may take time (depending on RAM you are using). On ubuntu where I am using 15 GB RAM on Cloud platform taken less time and on Windows on a local machine having 4 GB RAM taken a lot of time.
Solution
Anaconda is necessary. Why?
Please check the official doc of pydicom (https://pydicom.github.io/pydicom/dev/getting_started.html) its mentioned "To install pydicom along with image handlers for compressed pixel data, we encourage you to use Miniconda or Anaconda" (Note for Windows I was getting a different error)
If you are using Ubuntu directly open Terminal. If you are using Windows then on Anaconda Navigator go to Environment from here start terminal. Execute the following commands on it:
pip install -U git+https://github.com/pydicom/pydicom.git
conda install pydicom --channel conda-forge
conda install -c conda-forge gdcm
Cross Check:
Now use .dcm file for which we got the Error. Try to use the following code in Python notebook
filename = 'FileName.dcm'
ds = pydicom.dcmread(filename)
plt.imshow(ds.pixel_array, cmap=plt.cm.bone)
It should print the output. Also try this code:
ds.pixel_array
This will give you the array containing values.

Clean install of Scientific Python without reformatting the disk

1st, the Disclaimer: I looked at numerous questions here, there is question about uninstalling all pip–installed, uninstalling matplotlib, moving to 3.5 from 2.7, –– but I didn't find an answer to my specific problem. Sorry if I didn't look hard enough.
Basically, my problem is I have a mess of different packages installed by different means at different times. Manifestations of this are:
1) I can import numpy from python but not from Jupyter notebook:
------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-5a0bd626bb1d> in <module>()
----> 1 import numpy
ImportError: No module named numpy
2) I cannot import nltk:
...
File "numpy.pxd", line 155, in init sklearn.utils.murmurhash (sklearn/utils/murmurhash.c:5029)
ValueError: numpy.dtype has the wrong size, try recompiling
likely many more.
I recently uninstalled jupyter and anaconda, installed anaconda again, - this didn't help.
I cannot uninstall numpy / scipy , although I can use them (?!?):
>:~%python -c 'from numpy.random import rand; print rand()'
0.946167984715
>:~%pip uninstall numpy
Cannot uninstall requirement numpy, not installed
I have two versions of Python:
2.7.11 under /usr/local/bin/ pointing to /usr/local/Cellar/python/2.7.11/bin/python
2.7.10 under /usr/bin/ pointing to
/System/Library/Frameworks/Python.framework/Versions
-- although the default is 2.7.10 (numpy works with it), and I think 2.7.11 was added by Anaconda (incorrectly, because it doesn't see numpy).
Yesterday I uninstalled everything I could think of, then upgraded to a new version of Mac OS (10.12.2 Sierra), and then re-installed anaconda – in vain.
I am close to reformatting the disk and starting from scratch.
Is there a better option?
Thank you! and sorry for so many details.
Not sure exactly on a MAC but these are the things to try on windows and it is probably similar on MAC.
Start with a clean install of anaconda then try "where python" and "where jupyter" (it is "which" instead of "where" on linux). This tells you where it is looking for the executables. Both should be subfolders of anaconda. If they are not check your PATH variable.
Now start python or jupyter; import sys and try sys.path. That tells you where python is looking for stuff. It should look only in anaconda sub folders.

Import NLTK : no module NLTK corpus

I have installed NLTK. Here's an image of the installation log.
When i use import nltk i get an error:
"No module named NLTK.corpus"
Here is a screenshot.
What could be the cause?
I think I had the same problem. So, downloading all the packages at once (since question didn't specify).
Start python and then import the packages, exit python and upgrade nltk. Modify the 'all' to download a specific corpus. Took me awhile to complete the 'all' download, I separately downloaded framenet_v15 and restarted the 'all' after. Upgrade nltk when the download is complete.
$ python
>>>import nltk
>>>nltk.download('all')
exit python
$ pip install --upgrade nltk
To fix this, you should rename your file to something else, say nltkXXX.py. Also make sure to remove "nltk.pyc" from your directory if it exists, since this will also be loaded (it's the byte compiled version of your code). After that, it should work fine.
If you are using the latest version of python, then try installing nltk using pip and the wheel downloaded from here:
http://www.lfd.uci.edu/~gohlke/pythonlibs/
Then in command prompt, use the command:
pip3 install
This should install nltk correctly.
After that check the installation in python using the command:
import nltk
and download the nltk data required using:
nltk.download()
If you find (Import NLTK : no module NLTK corpus) that type of error .
Make sure your saved file not be the name like (nltk.py).
so just rename your file name (like rename nltk.py to example.py ) or something else:
I hope it will help you.
thanks
If you has using PyCharm IDE, you should have install NLTK from the IDE's own tools [File -> Settings -> Projetct Interpreter -> Install (button '+') -> Install Package].

Installing numpy on mac with pip: "requirements already satisfied" but "No module numpy"

I have python2.7.8 on mac, things I did:
sudo easy_install pip - worked.
pip install numpy:
Requirement already satisfied (use --upgrade to upgrade): numpy in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python
I also did "pip upgrade numpy" - no luck. What's wrong?
Your problem is a conflict of different Python versions.
I would recommend installing Python and all the packages, such as numpy, scipy, matplotlib, pandas, etc via Brew
See this tutorial: https://github.com/Homebrew/homebrew/blob/master/share/doc/homebrew/Homebrew-and-Python.md
You can verify which Python you're running with which python or which python3 in Terminal.
This solution is more flexible and cleaner in my opinion than using Conda/Miniconda. However it is also a bit more lengthy to install, as you need to have Xcode, devtools installed to build everything
Could it be that you have multiple versions of python installed? What happens if you run python using the full path like this:
$ /System/Library/Frameworks/Python.framework/Versions/2.7/bin/python2
instead of just python2?
In my experience on Mac (and other OS too) it is best to go with Anaconda / Miniconda. This is especially true for packages like NumPy and others from scientific stack.
While Anaconda is a full-blown distribution with about 200 packages, Miniconda is just Python with a few basic libraries. The big advantage is that all packages install as binary. Further, it makes it very simple and stable to install multiple Python versions side by side. For example:
conda create -n py27 python=2.7
creates a new environment with Python 2.7. Activate with:
source activate py27
Now:
conda install numpy
installs NumPy cleanly.
You can do the same for Python 3.5 and switch between environments with source activate.
After jumping from one stackoverflow answer to another I found the solution!
my problems were:
numpy at different location( actually at right, expected-to-be location). It was the IDLE that looks for its own default folder where python2.7 installed.
I checked that my numpy is working like this, run this script to check it is working:
import os
import sys
import pygame
sys.path.insert(0, '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python')
import numpy
pygame.init()
print "( using __version__): " + numpy.__version__
print numpy.version.version
user_paths = os.environ['PYTHONPATH']
print(user_paths)
sys.path insertion adds additional path to IDLE, so it knows where to look for numpy.
Then I check if numpy truly imported - i just print its version. Right now it is 1.8.0rc.
I want to find a way to avoid using this syspath insertion all the time.
So far so good - for now.
I had a similiar problem with numpy. However, it was resolved by choosing the right environment. If you are using VScode, open the command palette (ctrl+shift+p) and type
Python: Select Interpreter.
From there, try choosing the right virtual environment/Interpreter.

How to import packages/modules into spyder?

I just installed python in my MacOS using the Anaconda distribution. My problem is although the packages (eg. matplotlib, numpy, scipy) came included with the installation, I have to import them to spyder every time which is tedious and it's also tiring that that I have to remind spyder of their functionalities.
For eg, in Windows, I only needed to type in the console:
x=array([...,...,...])
but in my mac it would have to be:
import numpy as py
and then type into the console:
x=py.array([...,...,...])
I do notice that in that the spyder-python console (Windows version), there is a text saying,
Imported NumPy 1.8.1, SciPy 0.13.3, Matplotlib 1.3.1 + guidata 1.6.1, guiqwt 2.3.2
Type "scientific" for more details.
That is probably the reason why I don't have to import anything in Windows because Spyder already did it.
How do I do the same for Mac?
Thank you

Resources