Saving Dask DataFrame with image column to HDF5 - image

I am trying to load images of varying sizes into a Dask DataFrame column and save the dataframe to HDF5 file format.
Here's the standard approach:
import glob
import dask.dataframe as dd
import pandas as pd
import numpy as np
from skimage.io import imread
dir = '/Users/petioptrv/Downloads/mask'
filenames = glob.glob(dir + '/*.png')[:5]
df = pd.DataFrame({"paths": filenames})
ddf = dd.from_pandas(df, npartitions=2)
ddf['images'] = ddf['paths'].apply(imread, meta=('images', np.uint8))
ddf.to_hdf('test.h5', '/data')
I get the following error message:
...
File "/Users/petioptrv/miniconda3/envs/dask/lib/python3.7/site-packages/pandas/io/pytables.py", line 2214, in set_atom_string
item=item, type=inferred_type
TypeError: Cannot serialize the column [images] because
its data contents are [mixed] object dtype
Essentially, PyTables detects that the column has an object dtype and checks if it's of type str. It's not, so it throws an exception.
I can probably hack it by opening the images into byte-arrays and converting those to strings, but that is far from the ideal scenario.

Try specifying the data_columns as suggested in this issue.
ddf.to_hdf('test.h5', '/data', format = 'table', data_columns = ['images'])

Related

ljspeech Hugging Face examples not working

When trying to run the ljspeech example, I get the following error, even when the model is moved to the only GPU in the system. I am using Cuda 11.7, Pytorch 1.13.1, and Fairseq 0.12.2.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)
The code used:
from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub
from fairseq.models.text_to_speech.hub_interface import TTSHubInterface
import IPython.display as ipd
import torch
models, cfg, task = load_model_ensemble_and_task_from_hf_hub(
"facebook/fastspeech2-en-ljspeech",
arg_overrides={"vocoder": "hifigan", "fp16": False}
)
model = models[0].to(torch.device('cuda'))
models[0] = model
TTSHubInterface.update_cfg_with_data_cfg(cfg, task.data_cfg)
generator = task.build_generator(models, cfg)
text = "Hello, this is a test run."
sample = TTSHubInterface.get_model_input(task, text)
wav, rate = TTSHubInterface.get_prediction(task, model, generator, sample)
ipd.Audio(wav, rate=rate)

Fiona Driver Error when downloading files via URL

This is simple to test if you get the error on your side:
import geopandas as gpd
gdf = gpd.read_file('https://hepgis.fhwa.dot.gov/fhwagis/AltFuels_Rounds1-5_2021-05-25.zip')
File "fiona/ogrext.pyx", line 540, in fiona.ogrext.Session.start
File "fiona/_shim.pyx", line 90, in fiona._shim.gdal_open_vector
fiona.errors.DriverError: '/vsimem/6101ab5f23764c15b5fe47aa52a049d6' not recognized as a supported file format.
Interestingly, I have received this error for other URLs recently and thought there was something wrong with the URL. But, now I suspect that something else is going on since it is happening with more than one URL. On the other hand, some URLs don't have this issue. One other interesting thing, this error only occurs sometimes. For instance, if I rerun that command it will work maybe 1 out of 20 times.
My Fiona version:
fiona 1.8.20 py39hea8b339_1 conda-forge
Any help would be much appreciated.
Investigating, the URL does not return a zip file. See code below, it actually returns a HTML input page...
import geopandas as gpd
import requests, io
from pathlib import Path
from zipfile import ZipFile, BadZipFile
import urllib
import fiona
url = "https://hepgis.fhwa.dot.gov/fhwagis/AltFuels_Rounds1-5_2021-05-25.zip"
try:
gdf = gpd.read_file(url)
except Exception:
f = Path.cwd().joinpath(urllib.parse.urlparse(url).path.split("/")[-1])
r = requests.get(url, stream=True, headers={"User-Agent": "XY"})
with open(f, "wb") as fd:
for chunk in r.iter_content(chunk_size=128):
fd.write(chunk)
try:
zfile = ZipFile(f)
zfile.extractall(f.stem)
except BadZipFile:
with open(f) as fh:
print(fh.read())

Pydotplus, Graphviz error: Program terminated with status: 1. stderr follows: 'C:\Users\En' is not recognized as an internal or external command

from pydotplus import graph_from_dot_data
from sklearn.tree import export_graphviz
from IPython.display import Image
dot_data = export_graphviz(tree,filled=True,rounded=True,class_names=['Setosa','Versicolor','Virginica'],feature_names=['petal length','petal width'],out_file=None)
graph = graph_from_dot_data(dot_data)
Image(graph.create_png())
Program terminated with status:
1. stderr follows: 'C:\Users\En' is not recognized as an internal or external command,
operable program or batch file.
it seems that it split my username into half.How do i overcome this?
I have a very similar example that I'm trying out, it's based on a ML how-to book which is working with a Taiwan Credit Card dataset predicting default risk. My setup is as follows:
from six import StringIO
from sklearn.tree import export_graphviz
from IPython.display import Image
import pydotplus
Then creating the decision tree plot is done in this way:
dot_data = StringIO()
export_graphviz(decision_tree=class_tree,
out_file=dot_data,
filled=True,
rounded=True,
feature_names = X_train.columns,
class_names = ['pay','default'],
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
I think it's all coming from the out_file=dot_data argument but cannot figure out where the file path is created and stored as print(dot_data.getvalue()) did not show any pathname.
In my research I came across sklearn.plot_tree() which seems to do everything that the graphviz does. So I took the above exporet_graphviz arguments and were matching arguments were in the .plot_tree method I added them.
I ended up with the following which created the same image as was found in the text:
from sklearn import tree
plt.figure(figsize=(20, 10))
tree.plot_tree(class_tree,
filled=True, rounded=True,
feature_names = X_train.columns,
class_names = ['pay','default'],
fontsize=12)
plt.show()

PySpark: How do I solve 'python worker failed to connect back' error when using pyproj package in Pandas UDF? (Converting lat/long to UTM coordinates)

I have a json file with lat/long coordinates, which I try to convert to UTM ("x", "y") in PySpark.
The .json file looks like this:
{"positionmessage":{"latitude": 51.822872161865234,"longitude": 4.905614852905273}}
{"positionmessage":{"latitude": 51.819644927978516, "longitude": 4.961687088012695}}
I read the json file in pyspark and try to convert to UTM ('x', 'y'-coord) in PySpark with the following script:
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DateType, FloatType, TimestampType, DoubleType
from pyspark.sql.functions import *
appName = "PySpark"
master = "local"
file_name = "lat_lon.JSON"
# Create Spark session
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.getOrCreate()
schema = StructType([
StructField("positionmessage",
StructType([
StructField('latitude', DoubleType(), True),
StructField('longitude', DoubleType(), True),
]))])
df = spark.read.schema(schema).json(file_name).select("positionmessage.*")
Until here no problem; the problem arises when I try to convert to UTM coordinates using the pyproj package (which worked in Pandas).
from pyspark.sql.functions import array, pandas_udf, PandasUDFType
from pyproj import Proj
from pandas import Series
# using decorator 'pandas_udf' to wrap the function.
#pandas_udf('array<double>', PandasUDFType.SCALAR)
def get_utm(x):
pp = Proj(proj='utm',zone=31,ellps='WGS84', preserve_units=False)
return Series([ pp(e[0], e[1]) for e in x ])
df = df.withColumn('utm', get_utm(array('longitude','latitude'))) \
.selectExpr("*", "utm[0] as X", "utm[1] as Y")
df.show()
I get the problem: " python worker failed to connect back", but there does not seem to be a problem with the code itself. What can the problem be?
You can use a plain UDF rather than Pandas UDF:
#udf(returnType=ArrayType(DoubleType()))
def get_utm(long, lat):
pp = Proj(proj='utm', zone=31, ellps='WGS84', preserve_units=False)
return pp(long, lat)
result = df.withColumn('utm', get_utm('longitude','latitude')).selectExpr("*", "utm[0] as X", "utm[1] as Y")

Saving or downloading plotly iplot images on Google Colaboratory

I have been attempting to download a plot created using plotly on google colaboratory. So far this is what I have attempted:
I have tried changing
files.download('foo.svg')
to
files.download('foo')
and I still get no results. I navigated to the files on Google colab and nothing shows there
import numpy as np
import pandas as pd
from plotly.offline import iplot
import plotly.graph_objs as go
from google.colab import files
def enable_plotly_in_cell():
import IPython
from plotly.offline import init_notebook_mode
display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
init_notebook_mode(connected=False)
#this actually shows the plot
enable_plotly_in_cell()
N = 500
x = np.linspace(0, 1, N)
y = np.random.randn(N)
df = pd.DataFrame({'x': x, 'y': y})
df.head()
data = [
go.Scatter(
x=df['x'], # assign x as the dataframe column 'x'
y=df['y']
)
]
iplot(data,image = 'svg', filename = 'foo')
files.download('foo.svg')
This is the error I am getting:
OSErrorTraceback (most recent call last)
<ipython-input-18-31523eb02a59> in <module>()
29 iplot(data,image = 'svg', filename = 'foo')
30
---> 31 files.download('foo.svg')
32
/usr/local/lib/python2.7/dist-packages/google/colab/files.pyc in download(filename)
140 msg = 'Cannot find file: {}'.format(filename)
141 if _six.PY2:
--> 142 raise OSError(msg)
143 else:
144 raise FileNotFoundError(msg) # pylint: disable=undefined-variable
OSError: Cannot find file: foo.svg
To save vector or raster images (e.g. SVGs or PNGs) from Plotly figures you need to have Kaleido (preferred) or Orca (legacy) installed, which is actually possible using the following commands in Colab:
Kaleido:
!pip install kaleido
Orca:
!pip install plotly>=4.0.0
!wget https://github.com/plotly/orca/releases/download/v1.2.1/orca-1.2.1-x86_64.AppImage -O /usr/local/bin/orca
!chmod +x /usr/local/bin/orca
!apt-get install xvfb libgtk2.0-0 libgconf-2-4
Once either of the above is done you can use the following code to make, show and export a figure (using plotly version 4):
import plotly.graph_objects as go
fig = go.Figure( go.Scatter(x=[1,2,3], y=[1,3,2] ) )
fig.show()
fig.write_image("image.svg")
fig.write_image("image.png")
The files can then be downloaded with:
from google.colab import files
files.download('image.svg')
files.download('image.png')
Try this, it does work for me:
import plotly.graph_objects as go
fig = go.Figure(...) # plot your fig
go.Figure.write_html(fig,"file.html") # write as html or image
files.download("file.html") # download your file and give me a vote my answer

Resources