How can I read a parquet file that uses the DELTA_BYTE_ARRAY encoding? - parquet

I am trying to read a parquet file that is quite bleeding edge, it uses v2 page format and encodes its columns with the DELTA_BYTE_ARRAY encoding. I was trying to read this file with the arrow/parquet C++ library but I just got Parquet error: Not yet implemented: Unsupported encoding. Next, I tried pyarrow.parquet and also this library just throws OSError: Not yet implemented: Unsupported encoding.
So, which open source library is able to read a file that is using this encoding?

Related

Differences between .proto, .pb and .pbtxt

As explained in their website and Wikipedia, the Protocol Buffers (or Protobuf) is "used to serialize structured data". The definition of the data structure is done in a .proto file that can be compiled by protoc and turned into code (.cc/.h, .py, .java...) that can be imported to several languages to manipulate and serialize the data.
My understanding is that the .pb files contain that data in binary and the .pbtxt are an equivalent that contain it in ascii. Is that correct?
If so, why are .pbtxt so readable? I've found some with commentaries (https://github.com/google/mediapipe/blob/master/mediapipe/graphs/hand_tracking/subgraphs/hand_renderer_cpu.pbtxt).
Also, are .pb/.pbtxt enough to interpret the data? Or do you need their .proto?

Autoload Feature File Format

With the Autoloader feature, As per the documentation the configuration cloudFiles.format supports json, csv, text, parquet, binary and so on. Wanted to know if there is support for XML ?
For streaming file data sources supported file formats are text, CSV, JSON, ORC, Parquet. My assumption is that only streaming file formats are supported.
Not sure if you got a chance to go through https://github.com/databricks/spark-xml for more complex xml-files with the spark-xml library . If you want to make use of this, auto loader won' t help.

PYSPARK - Reading, Converting and splitting a EBCDIC Mainframe file into DataFrame

We have an EBCDIC Mainframe format file which is already loaded into Hadoop HDFS Sytem. The File has the Corresponding COBOL structure as well. We have to Read this file from HDFS, Convert the file data into ASCII format and need to split the data into Dataframe based on its COBOL Structure. I've tried some options which didn't seem to work. Could anyone please suggest us some proven or working ways.
For python, take a look at the Copybook package (https://github.com/zalmane/copybook). It supports most features of Copybook includes REDEFINES and OCCURS as well as a wide variety of PIC formats.
pip install copybook
root = copybook.parse_file('sample.cbl')
For parsing into a PySpark dataframe, you can use a flattened list of fields and use a UDF to parse based on the offsets:
offset_list = root.to_flat_list()
disclaimer : I am the maintainer of https://github.com/zalmane/copybook
Find the COBOL Language Reference manual and research functions DISPLAY-OF and National-Of. The link : https://www.ibm.com/support/pages/how-convert-ebcdic-ascii-or-ascii-ebcdic-cobol-program.

Protobuf message - parsing difference between binary and text files

During my implementation on a protocol buffer application, I tried to work with the text pbtxt files to ease up my programming. The idea was to switch to the pb binary format afterward, once I had a clearer understanding of the API. (I am working in C++)
I made my application working by importing the file with TextFormat::Parse. (The content of the file came from TextFormat::Print). I then generated the corresponding binary file, that I tried to import with myMessageVariable.ParsefromCodedStream (file not compressed). But I notice that only a very small part of the message is imported. The myMessageVariable.IsInitialized returns true, thus I guess that the library "thinks" that it has completely imported the file.
So my question: is there something different in the way the file are imported that could make the import "half-fail"? (Besides the obvious reason that one is binary and the other one is in text?) And what can we do against it?
There are a few differences in reading text data and reading binary data:
Text files sometimes use automatic linefeed conversion (\r\n vs. \n), especially on Windows platforms. This has to be disabled by opening the file in binary mode.
Binary files can contain null bytes at any point. Some text processing functions stop reading at the first null byte.
It could help if you can determine more about how much of the message gets parsed. Then you can look at what kind of bytes are near the problem point, using e.g. hex editor.

MRI images "slideshow"

If I have a series of MRI images in dicom file format, how would I upload them all and view them as a slideshow?
I have a folder, which contains a .dat file and 94 .dcm slices. I've tried "load" and got the following message:
>> load('dcmsermap.dat')
Error using load
Number of columns on line 2 of ASCII file
C:\Users\michelle\Desktop\COLAO\dcmsermap.dat
must be the same as previous lines.
"imread" and gotten the following:
>> imread('dcmsermap.dat')
Error using imread (line 382)
Unable to determine the file format.
I did textscan too and got similar error messages.
A good option could be to create a DICOMDIR structure. Matlab seems to have functions to read dicomdir.
You can create easily the dicomdir structure using command line tools either from the dcmtk toolkit or from the dcm4che2 one.

Resources