How to change column datatype with pyarrow - parquet

I am reading a set of arrow files and am writing them to a parquet file:
import pathlib
from pyarrow import parquet as pq
from pyarrow import feather
import pyarrow as pa
base_path = pathlib.Path('../mydata')
fields = [
pa.field('value', pa.int64()),
pa.field('code', pa.dictionary(pa.int32(), pa.uint64(), ordered=False)),
]
schema = pa.schema(fields)
with pq.ParquetWriter('sample.parquet', schema) as pqwriter:
for file_path in base_path.glob('*.arrow'):
table = feather.read_table(file_path)
pqwriter.write_table(table)
My problem is that the code field in the arrow files is defined with an int8 index instead of int32. The range of int8 however is insufficient. Hence I defined a schema with a int32 index for the field code in the parquet file.
However, writing the arrow table to parquet now complains that the schemas do not match.
How can I change the datatype of the arrow column? I checked the pyarrow API and did not find a way to change the schema. Can this be done without roundtripping to pandas?

Arrow ChunkedArray has got a cast function, but unfortunately it doesn't work for what you want to do:
>>> table['code'].cast(pa.dictionary(pa.int32(), pa.uint64(), ordered=False))
Unsupported cast from dictionary<values=uint64, indices=int8, ordered=0> to dictionary<values=uint64, indices=int32, ordered=0> (no available cast function for target type)
Instead you can cast to pa.uint64() and encode it to dictionary:
>>> table['code'].cast(pa.uint64()).dictionary_encode().type
DictionaryType(dictionary<values=uint64, indices=int32, ordered=0>)
Here's a self contained example:
import pyarrow as pa
source_schema = pa.schema([
pa.field('value', pa.int64()),
pa.field('code', pa.dictionary(pa.int8(), pa.uint64(), ordered=False)),
])
source_table = pa.Table.from_arrays([
pa.array([1, 2, 3], pa.int64()),
pa.array([1, 2, 1000], pa.dictionary(pa.int8(), pa.uint64(), ordered=False)),
], schema=source_schema)
destination_schema = pa.schema([
pa.field('value', pa.int64()),
pa.field('code', pa.dictionary(pa.int32(), pa.uint64(), ordered=False)),
])
destination_data = pa.Table.from_arrays([
source_table['value'],
source_table['code'].cast(pa.uint64()).dictionary_encode(),
], schema=destination_schema)

Related

PyArrow: Writing a parquet file with a particular schema

For testing purposes, I am trying to generate a file with dummy data, but with the following schema (schema of the real data):
pa.schema([
pa.field('field1', pa.int64()),
pa.field('field2', pa.list_(pa.field('element', pa.int64()))),
pa.field('field3', pa.list_(pa.field('element', pa.float64()))),
pa.field('field4', pa.list_(pa.field('element', pa.float64()))),
], )
I have the following code:
import pyarrow as pa
import pyarrow.parquet as pq
loc = "test.parquet"
data = {
"field1": [0],
"field2": [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
"field3": [[1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9]],
"field4": [[2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9]]
}
schema1 = pa.schema([
pa.field('field1', pa.int64()),
pa.field('field2', pa.list_(pa.field('element', pa.int64()))),
pa.field('field3', pa.list_(pa.field('element', pa.float64()))),
pa.field('field4', pa.list_(pa.field('element', pa.float64()))),
], )
schema2 = pa.schema([
pa.field('field1', pa.int64()),
pa.field('field2', pa.list_(pa.int64())),
pa.field('field3', pa.list_(pa.float64())),
pa.field('field4', pa.list_(pa.float64())),
], )
writer = pq.ParquetWriter(loc, schema1)
writer.write(pa.table(data))
writer.close()
The dictionary in the code, when converted to a PyArrow table and written to a parquet file, generates a file whose schema matches schema2. Passing schema1 to the writer gives an error. How can I change the dictionary in such a way that its schema matches schema1 when converted to a table?
Semantically the schemas are the same, the name of the list item ("element") should not matter. This used to be an issue but has been fixed in pyarrow 11.0.0 (https://issues.apache.org/jira/browse/ARROW-14999)
So you can upgrade to pyarrow and it should work.
Alternatively you can make sure your table has got the correct schema by doing either:
writer.write(pa.table(data, schema=schema1))
Or casting by casting it:
writer.write(pa.table(data).cast(schema1))

How to plot when we have dropdown menu with multiple selections?

I am pretty new to dash plotly. I am trying to create a dropdown menu with multiple selections ON. I have a dataframe with the column names [‘col1’, ‘col2’, ‘col3’, ‘col4’, ‘col5’]. I want to plot this dataframe to have ‘col5’ in the y-axis, and the rest of the columns in the dropdown menu for the x-axis selection. I want to add these columns to my x-axis when I selected them from the dropdown menu.
I cannot understand what I need to modify to make it work. I checked the published posts but couldn’t figure it out.
I get “Callback error updating mygraph1.figure” when I run my code below:
import dash_bootstrap_components as dbc
import pandas as pd
import numpy as np
import plotly.express as px
from dash.exceptions import PreventUpdate
df = pd.DataFrame(np.random.randn(100, 5))
df.columns = (['col1', 'col2', 'col3', 'col4', 'col5'])
app = Dash(external_stylesheets=[dbc.themes.SUPERHERO])
app.layout = dbc.Container([
dbc.Row([
dbc.Col([
dcc.Dropdown(id='mydd1',
options=df.columns.values[0:4],
multi= True,
clearable=True,
value=[])
], width=4),
]),
dbc.Row([
dbc.Col([
dcc.Graph(id='mygraph1', figure={})
], width=4),
])
], fluid=True)
#app.callback(
Output('mygraph1', 'figure'),
Input('mydd1', 'value'),
)
def update_title(X):
if X is None:
raise PreventUpdate
fig1 = px.line(df, x=df[X], y=df['col5'])
return fig1
if __name__ == "__main__":
app.run_server(debug = True, port=8055)
The problem is in the value of X, it is a list and you cannot use is directly like df[X] because it is invalid synatx in pandas to write df[['col1']].
Full example:
import dash
import dash_bootstrap_components as dbc
from dash import Dash, html, dcc,dash_table
from dash.dependencies import Input, Output
import plotly.graph_objs as go
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 5))
df.columns = (['col1', 'col2', 'col3', 'col4', 'col5'])
app = dash.Dash(external_stylesheets=[dbc.themes.SUPERHERO])
app.layout = dbc.Container([
dbc.Row([
dbc.Col([
dcc.Dropdown(id='mydd1',
options=df.columns.values[0:4],
multi= True,
clearable=True,
value=[])
], width=4),
]),
dbc.Row([
dbc.Col([
dcc.Graph(id='mygraph1', figure={})
], width=4),
])
], fluid=True)
#app.callback(
Output('mygraph1', 'figure'),
Input('mydd1', 'value'),
)
def update_title(X):
if X == []:
return dash.no_update
fig = go.Figure()
for idx, col in enumerate(X):
fig.add_trace(go.Scatter(x =df[col] , y =df['col5'], mode ='lines', name = col))
return fig
app.run_server(debug=True, use_reloader=False)
Output

How to solve for pyodbc.ProgrammingError: The second parameter to executemany must not be empty

Hi i'm having an issue with the transfer of data from one database to another. I created a list using field in a table on a msql db, used that list to query and oracle db table (using the initial list in the where statement to filter results) I then load the query results back into the msql db.
The program runs for the first few iterations but then errors out, with the following error (
Traceback (most recent call last):
File "C:/Users/1/PycharmProjects/DataExtracts/BuyerGroup.py", line 67, in
insertIntoMSDatabase(idString)
File "C:/Users/1/PycharmProjects/DataExtracts/BuyerGroup.py", line 48, in insertIntoMSDatabase
mycursor.executemany(sql, val)
pyodbc.ProgrammingError: The second parameter to executemany must not be empty.)
I can't seem to find and guidance online to troubleshoot this error message. I feel it may be a simple solution but I just can't get there...
# import libraries
import cx_Oracle
import pyodbc
import logging
import time
import re
import math
import numpy as np
logging.basicConfig(level=logging.DEBUG)
conn = pyodbc.connect('''Driver={SQL Server Native Client 11.0};
Server='servername';
Database='dbname';
Trusted_connection=yes;''')
b = conn.cursor()
dsn_tns = cx_Oracle.makedsn('Hostname', 'port', service_name='name')
conn1 = cx_Oracle.connect(user=r'uid', password='pwd', dsn=dsn_tns)
c = conn1.cursor()
beginTime = time.time()
bind = (b.execute('''select distinct field1
from [server].[db].[dbo].[table]'''))
print('MSQL table(s) queried, List Generated')
# formats ids for sql string
def surroundWithQuotes(id):
return "'" + re.sub(",|\s$", "", str(id)) + "'"
def insertIntoMSDatabase(idString):
osql = '''SELECT distinct field1, field2
FROM Database.Table
WHERE field2 is not null and field3 IN ({})'''.format(idString)
c.execute(osql)
claimsdata = c.fetchall()
print('Oracle table(s) queried, Data Pulled')
mycursor = conn.cursor()
sql = '''INSERT INTO [dbo].[tablename]
(
[fields1]
,[field2]
)
VALUES (?,?)'''
val = claimsdata
mycursor.executemany(sql, val)
conn.commit()
ids = []
formattedIdStrings = []
# adds all the ids found in bind to an iterable array
for row in bind:
ids.append(row[0])
# splits the ids[] array into multiple arrays < 1000 in length
batchedIds = np.array_split(ids, math.ceil(len(ids) / 1000))
# formats the value inside each batchedId to be a string
for batchedId in batchedIds:
formattedIdStrings.append(",".join(map(surroundWithQuotes, batchedId)))
# runs insert into MS database for each batch of IDs
for idString in formattedIdStrings:
insertIntoMSDatabase(idString)
print("MSQL table loaded, Data inserted into destination")
endTime = time.time()
print("Program Time Elapsed: ",endTime-beginTime)
conn.close()
conn1.close()
mycursor.executemany(sql, val)
pyodbc.ProgrammingError: The second parameter to executemany must not be empty.
Before calling .executemany() you need to verify that val is not an empty list (as would be the case if .fetchall() is called on a SELECT statement that returns no rows) , e.g.,
if val:
mycursor.executemany(sql, val)

Transform a list of files (JSON) to a dataframe

Spark Version: '2.0.0.2.5.0.0-1245'
So, my original question changed a bit but it's still the same issue.
What I want to do is load a huge amount of JSON files and transform those to a DataFrame - also probably save them as CSV or parquet file for further processing. Each JSON file represents one row in the final DataFrame.
import os
import glob
HDFS_MOUNT = # ...
DATA_SET_BASE = # ...
schema = StructType([
StructField("documentId", StringType(), True),
StructField("group", StringType(), True),
StructField("text", StringType(), True)
])
# Get the file paths
file_paths = glob.glob(os.path.join(HDFS_MOUNT, DATA_SET_BASE, '**/*.json'))
file_paths = [f.replace(HDFS_MOUNT + '/', '') for f in file_paths]
print('Found {:d} files'.format(len(file_paths))) # 676 files
sql = SQLContext(sc)
df = sql.read.json(file_paths, schema=schema)
print('Loaded {:d} rows'.format(df.count())) # 9660 rows (what !?)
Besides the fact that there are 9660 rows instead of 676 (number of available files) I also have the problem that the content seems to be None:
df.head(2)[0].asDict()
gives
{
'documentId': None,
'group': None,
'text': None,
}
Example Data
This is just fake data of course but it resembles the actual data.
Note: Some fields may be missing e.g. text must not always be present.
a.json
{
"documentId" : "001",
"group" : "A",
"category" : "indexed_document",
"linkIDs": ["adiojer", "asdi555", "1337"]
}
b.json
{
"documentId" : "002",
"group" : "B",
"category" : "indexed_document",
"linkIDs": ["linkId", "1000"],
"text": "This is the text of this document"
}
assuming that all your files has the same structure and are in the same directory:
df = sql_cntx.read.json('/hdfs/path/to/folder/*.json')
There might be a problem if any of the columns has Null values for all rows. Then spark will not be able to determine schema, so you have an option to tell spark which schema to use:
from pyspark import SparkContext, SQLContext
from pyspark.sql.types import StructType, StructField, StringType, LongType
sc = SparkContext(appName="My app")
sql_cntx = SQLContext(sc)
schema = StructType([
StructField("field1", StringType(), True),
StructField("field2", LongType(), True)
])
df = sql_cntx.read.json('/hdfs/path/to/folder/*.json', schema=schema)
UPD:
in case if file has multirows formatted json you can try this code:
sc = SparkContext(appName='Test')
sql_context = SQLContext(sc)
rdd = sc.wholeTextFiles('/tmp/test/*.json').values()
df = sql_context.read.json(rdd, schema=schema)
df.show()

pandas series sort_index() not working with kind='mergesort'

I needed a stable index sorting for DataFrames, when I had this problem:
In cases where a DataFrame becomes a Series (when only a single column matches the selection), the kind argument returns an error. See example:
import pandas as pd
df_a = pd.Series(range(10))
df_b = pd.Series(range(100, 110))
df = pd.concat([df_a, df_b])
df.sort_index(kind='mergesort')
with the following error:
----> 6 df.sort_index(kind='mergesort')
TypeError: sort_index() got an unexpected keyword argument 'kind'
If DataFrames (more then one column is selected), mergesort works ok.
EDIT:
When selecting a single column from a DataFrame for example:
import pandas as pd
import numpy as np
df_a = pd.DataFrame(np.array(range(25)).reshape(5,5))
df_b = pd.DataFrame(np.array(range(100, 125)).reshape(5,5))
df = pd.concat([df_a, df_b])
the following returns an error:
df[0].sort_index(kind='mergesort')
...since the selection is casted to a pandas Series, and as pointed out the pandas.Series.sort_index documentation contains a bug.
However,
df[[0]].sort_index(kind='mergesort')
works alright, since its type continues to be a DataFrame.
pandas.Series.sort_index() has no kind parameter.
here is the definition of this function for Pandas 0.18.1 (file: ./pandas/core/series.py):
# line 1729
#Appender(generic._shared_docs['sort_index'] % _shared_doc_kwargs)
def sort_index(self, axis=0, level=None, ascending=True, inplace=False,
sort_remaining=True):
axis = self._get_axis_number(axis)
index = self.index
if level is not None:
new_index, indexer = index.sortlevel(level, ascending=ascending,
sort_remaining=sort_remaining)
elif isinstance(index, MultiIndex):
from pandas.core.groupby import _lexsort_indexer
indexer = _lexsort_indexer(index.labels, orders=ascending)
indexer = com._ensure_platform_int(indexer)
new_index = index.take(indexer)
else:
new_index, indexer = index.sort_values(return_indexer=True,
ascending=ascending)
new_values = self._values.take(indexer)
result = self._constructor(new_values, index=new_index)
if inplace:
self._update_inplace(result)
else:
return result.__finalize__(self)
file ./pandas/core/generic.py, line 39
_shared_doc_kwargs = dict(axes='keywords for axes', klass='NDFrame',
axes_single_arg='int or labels for object',
args_transpose='axes to permute (int or label for'
' object)')
So most probably it's a bug in the pandas documentation...
Your df is Series, it's not a data frame

Resources