How to read the arrow parquet key value metadata? - parquet

When I save a parquet file in R and Python (using pyarrow) I get a arrow schema string saved in the metadata.
How do I read the metadata? Is it Flatbuffer encoded data? Where is the definition for the schema? It's not listed on the arrow documentation site.
The metadata is a key-value pair that looks like this
key: "ARROW:schema"
value: "/////5AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAEAAAAyP///wQAAAABAAAAFAAAABAAGAAIAAYABwAMABAAFAAQAAAAAAABBUAAAAA4AAAAEAAAACgAAAAIAAgAAAAEAAgAAAAMAAAACAAMAAgABwA…
as a result of writing this in R
df = data.frame(a = factor(c(1, 2)))
arrow::write_parquet(df, "c:/scratch/abc.parquet")

The schema is base64-encoded flatbuffer data. You can read the schema in Python using the following code:
import base64
import pyarrow as pa
import pyarrow.parquet as pq
meta = pq.read_metadata(filename)
decoded_schema = base64.b64decode(meta.metadata[b"ARROW:schema"])
schema = pa.ipc.read_schema(pa.BufferReader(decoded_schema))

Related

Errors writing Geopandas Dataframe to Shapefile

I'm writing a geopandas polygon file to an Esri Shapefile. I can't write directly because I have date fields that I don't want to convert to text, I want to keep them as date.
I've written a custom schema, but how do I handle the geomtry column in the custom shapefile? It's a WKT field.
This is my custom schema (shortened for length):
schema = {
'geometry':'MultiPolygon',
'properties':{
'oid':'int',
'date_anncd':'date',
'value_mm':'float',
'geometry':??
}
}
Change the geometry type to 'Polygon' (MultiPolygon isn't an Esri type) and drop the geometry value in the properties.

http request with parquet and pyarrow

I would like to use pyarrow to read/query parquet data from a rest server. At the moment I'm chunking the data, converting to pandas, dumping to json, and streaming the chunks. Like:
p = pq.ParquetDataset('/path/to/data.parquet', filters=filter, use_legacy_dataset=False)
batches = p._dataset.to_batches(filter=p._filter_expression)
(json.dumps(b.to_pandas().values.tolist()) for b in batches)
This is effectively the same work as
ds = pq.ParquetDataset('/path/to/data.parquet',
use_legacy_dataset=False,
filters=filters)
df = ds.read().to_pandas()
data = pd.DataFrame(orjson.loads(orjson.dumps(df.values.tolist())))
without the network io. It's about 50x slower than just reading to pandas directly
df = ds.read().to_pandas()
Is there a way to serialize the parquet dataset to some binary string that I can send over http and parse on the client side?
You can send your data using the arrow in memory columnar format. It will be much more efficient and compact than json. but bear in mind it will be binary data (which unlike json is not human readable)
See the doc for a full example.
In your case you want to do something like this:
ds = pq.ParquetDataset('/path/to/data.parquet',
use_legacy_dataset=False,
filters=filters)
table = ds.read() # pa.Table
# Write the data:
batches = table.to_batches()
sink = pa.BufferOutputStream()
writer = pa.ipc.new_stream(sink, table.schema)
for batch in batches:
writer.write(batch)
writer.close()
buf = sink.getvalue()
# Read the data:
reader = pa.ipc.open_stream(buf)
read_batches = [b for b in reader]
read_table = pa.Table.from_batches(read_batches)
read_table.to_pandas()

Maximo MAXINTMSGTRK table: How to extract text from MSGDATA column? (HUGEBLOB)

I'm attempting to extract the text from the MSGDATA column (HUGEBLOB) in the MAXINTMSGTRK table:
I've tried the options outlined here: How to query hugeblob data:
select
msg.*,
utl_raw.cast_to_varchar2(dbms_lob.substr(msgdata,1000,1)) msgdata_expanded,
dbms_lob.substr(msgdata, 1000,1) msgdata_expanded_2
from
maxintmsgtrk msg
where
rownum = 1
However, the output is not text:
How can I extract text from MSGDATA column?
It's is possible to do it using Automation script, uncompress data using psdi.iface.jms.MessageUtil class.
from psdi.iface.jms import MessageUtil
...
msgdata_blob = maxintmsgtrkMbo.getBytes("msgdata")
byteArray = MessageUtil.uncompressMessage(msgdata_blob, maxintmsgtrkMbo.getLong("msglength"))
msgdata_clob = ""
for symb1 in byteArray:
msgdata_clob = msgdata_clob + chr(symb1)
It sounds like it's not possible because the value is compressed:
Starting in Maximo 7.6, the messages written by the Message Tracking
application are stored in the database. They are no longer written as
xml files as in previous versions.
Customers have asked how to search and view MSGDATA data from the
MAXINTMSGTRK table.
It is not possible to search or retrieve the data in the maxintmsgtrk
table in 7.6.using SQL. The BLOB field is stored compressed.
MIF 7.6 Message tracking changes

How to load Postgress "Text" data type into HIVE

I have a postgress table which has text column (detail). I have declared detail as STRING in Hive. It is getting imported successfully When i try to import it from SQOOP or SPark . However i am missing lot of data which is available in detail column and lot of empty rows are getting created in hive table.
Can anyone help me on this?
Ex: detail column has below data
line1 sdhfdsf dsfdsdfdsf dsfs
line2 jbdfv df ffdkjbfd
jbdsjbfds
dsfsdfb dsfds
dfds dsfdsfds dsfdsdskjnfds
sdjfbdsfdsdsfds
Only "line1 sdhfdsf dsfdsdfdsf dsfs " is getting imported into hive table.
I can see empty rows for remaining lines.
Hive cannot support multiple lines in text file formats. You must load this data into a binary file, Avro or Parquet, to retain newline characters. If you don't need to retain them then you can strip them with hive-drop-import-delims
Here is the solution
SparkConf sparkConf = new SparkConf().setAppName("HiveSparkSQL");
SparkContext sc = new SparkContext(sparkConf);
HiveContext sqlContext= new HiveContext(sc);
sqlContext.setConf("spark.sql.parquet.binaryAsString","true");
String url="jdbc:postgresql://host:5432/dbname?user=**&password=***";
Map<String, String> options = new HashMap<String, String>();
options.put("url", url);
options.put("dbtable", "(select * from abc.table limit 50) as act1");
options.put("driver", "org.postgresql.Driver");
DataFrame jdbcDF = sqlContext.read().format("jdbc").options(options).load();
jdbcDF.write().format("parquet").mode(SaveMode.Append).saveAsTable("act_parquet");

Pig LOADER for SPLUNK like records

I am trying to use PIG to read data from HDFS where the files contain rows that look like:
"key1"="value1", "key2"="value2", "key3"="value3"
"key1"="value10", "key3"="value30"
In a way the rows of the data are essentially dictionaries:
{"key1":"value1", "key2":"value2", "key3":"value3"}
{"key1":"value10", "key3":"value30"}
I can read and dump portion of this data easily enough with something like:
data = LOAD '/hdfslocation/weirdformat*' as PigStorage(',');
sampled = SAMPLE data 0.00001;
dump sampled;
My problem is that I can't parse it efficiently. I have tried to use
org.apache.pig.piggybank.storage.MyRegExLoader
but it seems extremely slow.
Could someone recommend a different approach?
Seems like one way is to use a python UDF.
This solution is heavily inspired from bag-to-tuple
In myudfs.py write:
#!/usr/bin/python
def FieldPairsGenerator(dataline):
for x in dataline.split(','):
k,v = x.split('=')
yield (k.strip().strip('"'),v.strip().strip('"'))
#outputSchema("foo:map[]")
def KVDataToDict(dataline):
return dict( kvp for kvp in FieldPairsGenerator(dataline) )
then write the following Pig script:
REGISTER 'myudfs.py' USING jython AS myfuncs;
data = LOAD 'whereyourdatais*.gz' AS (foo:chararray);
A = FOREACH data GENERATE myfuncs.KVDataToDict(foo);
A now has the data stored as a PigMap

Resources