PyArrow: Writing a parquet file with a particular schema - parquet

For testing purposes, I am trying to generate a file with dummy data, but with the following schema (schema of the real data):
pa.schema([
pa.field('field1', pa.int64()),
pa.field('field2', pa.list_(pa.field('element', pa.int64()))),
pa.field('field3', pa.list_(pa.field('element', pa.float64()))),
pa.field('field4', pa.list_(pa.field('element', pa.float64()))),
], )
I have the following code:
import pyarrow as pa
import pyarrow.parquet as pq
loc = "test.parquet"
data = {
"field1": [0],
"field2": [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
"field3": [[1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9]],
"field4": [[2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9]]
}
schema1 = pa.schema([
pa.field('field1', pa.int64()),
pa.field('field2', pa.list_(pa.field('element', pa.int64()))),
pa.field('field3', pa.list_(pa.field('element', pa.float64()))),
pa.field('field4', pa.list_(pa.field('element', pa.float64()))),
], )
schema2 = pa.schema([
pa.field('field1', pa.int64()),
pa.field('field2', pa.list_(pa.int64())),
pa.field('field3', pa.list_(pa.float64())),
pa.field('field4', pa.list_(pa.float64())),
], )
writer = pq.ParquetWriter(loc, schema1)
writer.write(pa.table(data))
writer.close()
The dictionary in the code, when converted to a PyArrow table and written to a parquet file, generates a file whose schema matches schema2. Passing schema1 to the writer gives an error. How can I change the dictionary in such a way that its schema matches schema1 when converted to a table?

Semantically the schemas are the same, the name of the list item ("element") should not matter. This used to be an issue but has been fixed in pyarrow 11.0.0 (https://issues.apache.org/jira/browse/ARROW-14999)
So you can upgrade to pyarrow and it should work.
Alternatively you can make sure your table has got the correct schema by doing either:
writer.write(pa.table(data, schema=schema1))
Or casting by casting it:
writer.write(pa.table(data).cast(schema1))

Related

Is it possible to specify the compression when using pyarrow write_dataset?

I would like to be able to control the type of compression used when partitioning (default is snappy).
import numpy.random
import pyarrow as pa
import pyarrow.dataset as ds
data = pa.table(
{
"day": numpy.random.randint(1, 31, size=100),
"month": numpy.random.randint(1, 12, size=100),
"year": [2000 + x // 10 for x in range(100)],
}
)
ds.write_dataset(
data,
"./tmp/partitioned",
format="parquet",
existing_data_behavior="delete_matching",
partitioning=ds.partitioning(
pa.schema(
[
("year", pa.int16()),
]
),
),
)
It is not clear to me, from the doc, if that's actually possible
There is an option to specify the file options.
file_options
pyarrow.dataset.FileWriteOptions, optional
FileFormat specific write options, created using the FileFormat.make_write_options() function.
You can use any of the compression options mentioned in the docs - snappy, gzip, brotli, zstd, lz4, none
Below code writes dataset using brotli compression.
import numpy.random
import pyarrow as pa
import pyarrow.dataset as ds
data = pa.table(
{
"day": numpy.random.randint(1, 31, size=100),
"month": numpy.random.randint(1, 12, size=100),
"year": [2000 + x // 10 for x in range(100)],
}
)
file_options = ds.ParquetFileFormat().make_write_options(compression='brotli')
ds.write_dataset(
data,
"./tmp/partitioned",
format="parquet",
existing_data_behavior="delete_matching",
file_options=file_options,
partitioning=ds.partitioning(
pa.schema(
[
("year", pa.int16()),
]
),
),
)

Why can't we convert flat columns of awkward1 arrays `to_parquet`?

A follow up from this question; Best way to save a dict of awkward1 arrays?
To save multiple columns of nested awkward1 arrays (with varying length);
import awkward1 as ak
dog = ak.from_iter([[1, 2], [5]])
cat = ak.from_iter([[4]])
pets = ak.zip({"dog": dog[np.newaxis], "cat": cat[np.newaxis]}, depth_limit=1)
ak.to_parquet(pets, "pets.parquet")
Unfortunately, this doesn't seem to work for flat lists;
import awkward1 as ak
dog = ak.from_iter([1, 2, 5])
cat = ak.from_iter([4])
pets = ak.zip({"dog": dog[np.newaxis], "cat": cat[np.newaxis]}, depth_limit=1)
ak.to_parquet(pets, "pets.parquet")
creates the error;
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-31-7f3a7fefb261> in <module>
3 cat = ak.from_iter([3])
4 pets = ak.zip({"dog": dog[np.newaxis], "cat": cat[np.newaxis]}, depth_limit=1)
----> 5 ak.to_parquet(pets, "pets.parquet")
~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/awkward/operations/convert.py in to_parquet(array, where, explode_records, list_to32, string_to32, bytestring_to32, **options)
2983 layout = to_layout(array, allow_record=False, allow_other=False)
2984 iterator = batch_iterator(layout)
-> 2985 first = next(iterator)
2986
2987 if "schema" not in options:
~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/awkward/operations/convert.py in batch_iterator(layout)
2978 )
2979 yield pyarrow.RecordBatch.from_arrays(
-> 2980 pa_arrays, schema=pyarrow.schema(pa_fields)
2981 )
2982
~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_arrays()
TypeError: object of type 'pyarrow.lib.Tensor' has no len()
What is the reason for encountering this error?
What you found is a bug, and now it is fixed: https://github.com/scikit-hep/awkward-1.0/pull/799
What's happening here is that pyarrow can't write pyarrow.lib.Tensor (regular-length lists, such as the one you created with np.newaxis) to Parquet files. Parquet files don't have a concept of "regular-length list," so that makes sense. But rather than converting it, pyarrow hits an unhandled case, in which it fails to find the length of that pyarrow.lib.Tensor. (It's a little odd that pyarrow.lib.Tensor doesn't have a __len__ method, but that's another thing.)
Anyway, with version 1.2.0 of Awkward Array, we'll simply convert regular-length lists into (in principle) variable-length lists when writing to Parquet, since the format doesn't have that type. According to the schedule, version 1.2.0 will be released tomorrow. (This bug-fix is likely the last prerelease.)

How to change column datatype with pyarrow

I am reading a set of arrow files and am writing them to a parquet file:
import pathlib
from pyarrow import parquet as pq
from pyarrow import feather
import pyarrow as pa
base_path = pathlib.Path('../mydata')
fields = [
pa.field('value', pa.int64()),
pa.field('code', pa.dictionary(pa.int32(), pa.uint64(), ordered=False)),
]
schema = pa.schema(fields)
with pq.ParquetWriter('sample.parquet', schema) as pqwriter:
for file_path in base_path.glob('*.arrow'):
table = feather.read_table(file_path)
pqwriter.write_table(table)
My problem is that the code field in the arrow files is defined with an int8 index instead of int32. The range of int8 however is insufficient. Hence I defined a schema with a int32 index for the field code in the parquet file.
However, writing the arrow table to parquet now complains that the schemas do not match.
How can I change the datatype of the arrow column? I checked the pyarrow API and did not find a way to change the schema. Can this be done without roundtripping to pandas?
Arrow ChunkedArray has got a cast function, but unfortunately it doesn't work for what you want to do:
>>> table['code'].cast(pa.dictionary(pa.int32(), pa.uint64(), ordered=False))
Unsupported cast from dictionary<values=uint64, indices=int8, ordered=0> to dictionary<values=uint64, indices=int32, ordered=0> (no available cast function for target type)
Instead you can cast to pa.uint64() and encode it to dictionary:
>>> table['code'].cast(pa.uint64()).dictionary_encode().type
DictionaryType(dictionary<values=uint64, indices=int32, ordered=0>)
Here's a self contained example:
import pyarrow as pa
source_schema = pa.schema([
pa.field('value', pa.int64()),
pa.field('code', pa.dictionary(pa.int8(), pa.uint64(), ordered=False)),
])
source_table = pa.Table.from_arrays([
pa.array([1, 2, 3], pa.int64()),
pa.array([1, 2, 1000], pa.dictionary(pa.int8(), pa.uint64(), ordered=False)),
], schema=source_schema)
destination_schema = pa.schema([
pa.field('value', pa.int64()),
pa.field('code', pa.dictionary(pa.int32(), pa.uint64(), ordered=False)),
])
destination_data = pa.Table.from_arrays([
source_table['value'],
source_table['code'].cast(pa.uint64()).dictionary_encode(),
], schema=destination_schema)

Transform a list of files (JSON) to a dataframe

Spark Version: '2.0.0.2.5.0.0-1245'
So, my original question changed a bit but it's still the same issue.
What I want to do is load a huge amount of JSON files and transform those to a DataFrame - also probably save them as CSV or parquet file for further processing. Each JSON file represents one row in the final DataFrame.
import os
import glob
HDFS_MOUNT = # ...
DATA_SET_BASE = # ...
schema = StructType([
StructField("documentId", StringType(), True),
StructField("group", StringType(), True),
StructField("text", StringType(), True)
])
# Get the file paths
file_paths = glob.glob(os.path.join(HDFS_MOUNT, DATA_SET_BASE, '**/*.json'))
file_paths = [f.replace(HDFS_MOUNT + '/', '') for f in file_paths]
print('Found {:d} files'.format(len(file_paths))) # 676 files
sql = SQLContext(sc)
df = sql.read.json(file_paths, schema=schema)
print('Loaded {:d} rows'.format(df.count())) # 9660 rows (what !?)
Besides the fact that there are 9660 rows instead of 676 (number of available files) I also have the problem that the content seems to be None:
df.head(2)[0].asDict()
gives
{
'documentId': None,
'group': None,
'text': None,
}
Example Data
This is just fake data of course but it resembles the actual data.
Note: Some fields may be missing e.g. text must not always be present.
a.json
{
"documentId" : "001",
"group" : "A",
"category" : "indexed_document",
"linkIDs": ["adiojer", "asdi555", "1337"]
}
b.json
{
"documentId" : "002",
"group" : "B",
"category" : "indexed_document",
"linkIDs": ["linkId", "1000"],
"text": "This is the text of this document"
}
assuming that all your files has the same structure and are in the same directory:
df = sql_cntx.read.json('/hdfs/path/to/folder/*.json')
There might be a problem if any of the columns has Null values for all rows. Then spark will not be able to determine schema, so you have an option to tell spark which schema to use:
from pyspark import SparkContext, SQLContext
from pyspark.sql.types import StructType, StructField, StringType, LongType
sc = SparkContext(appName="My app")
sql_cntx = SQLContext(sc)
schema = StructType([
StructField("field1", StringType(), True),
StructField("field2", LongType(), True)
])
df = sql_cntx.read.json('/hdfs/path/to/folder/*.json', schema=schema)
UPD:
in case if file has multirows formatted json you can try this code:
sc = SparkContext(appName='Test')
sql_context = SQLContext(sc)
rdd = sc.wholeTextFiles('/tmp/test/*.json').values()
df = sql_context.read.json(rdd, schema=schema)
df.show()

Querying on Json data type in postgres Rails-4

I am using Rails-4, have a Product model and stored specifications as JSon type
In Migration file, add
add_column :products, :specifications, :json
sample product record is look like
#<Product id: 1, prod_id: 525141, cat_id: 6716, category_id: 5, updated: "2013-09-24 07:37:20", created_at: "2014-03-07 12:21:34", updated_at: "2014-03-07 12:32:36", eans: ["4016032274001"], skus: ["DK-1511-010F/WH"], account_id: 2, specifications: {"network"=>["PCI-Express 2.1 16x", "CardBus", "PCI-Express 3.0 16x", "PCI 64-bit, 66MHz", "PCI 64-bit, 33MHz", "PCI 32-bit, 66MHz", "PCI 3.0", "PCI 2.3", "PCI 2.2", "PCI-X", "PCI-Express 16x", "PCI-Express 8x", "PCI-Express 4x", "PCI-Express 2.0 16x", "PCI-Express 1x", "PCI", "PC Card", "ISA", "AGP 8x", "AGP 4x", "AGP 2x", "AGP 1x"], "rating"=>[4]}>
I want to query on product's Specification.eg: get all products that rating(inside specifications) equals to 4.
Is any gem available to implement this?
With the advent of jsonb in Rails 4.2 with Postgres 9.4, you can do this IF the rating field is a string and not an integer. Below is an example.
Dynamic.create(payload: {"rating"=>['4']})
Dynamic.create(payload: {"rating"=>['3']})
Dynamic.where("payload -> 'rating' ? '4'").count
This will give you the proper records. You will have to update your json field to jsonb. You can do this by following my response here.
You can find out more about how to work with jsonb in Rails 4.2 here.
Yes, there is JSONQuery with is implemented in Rails via this Gem: https://github.com/jcoglan/siren

Resources