how to fetch metadata from parquet file using pyarrow - parquet

The request is to fetch the schema details of a parquet file using pyarrow. The column names could be fetched using the below code. But, unable to fetch other column metadata (data type, null-ability).
schema = ParquetDataset.schema
column_count = {"column_count":len(schema.names)}
for index, columnname in enumerate(schema.names,1):
column_id = str(index)
column_name = str(columnname)
The code is in python. When the schema is printed, the required fields and data type is available.
<pyarrow._parquet.ParquetSchema object at 0x0000015FFFA1B730>
required group field_id=0 org.apache.avro.file.Header {
optional int64 field_id=1 XXXX;
optional binary field_id=2 XXXX (String);
optional int64 field_id=3 XXXX;
optional binary field_id=4 XXXX (String);
optional binary field_id=5 XXXX (String);
optional binary field_id=6 XXXX (String);
optional binary field_id=7 XXXX (String);
optional binary field_id=8 XXXX (String);
optional binary field_id=9 XXXX (String);
optional binary field_id=10 XXXX (String);
}

Related

How should protobuf message with repeated fields be converted to parquet to be queried by Athena?

We write parquet files to S3 and then use Athena to query from that data. We use "parquet-protobuf" library to convert proto message into parquet record. We recently added a repeated field into our proto message definition and we were expecting to be able to fetch/query that using Athena.
message LedgerRecord {
....
repeated uint64 shared_account_ids = 168; //newly added field
}
Resulting parquet schema as seen by parquet-tools inspect command.
############ Column(shared_account_ids) ############
name: shared_account_ids
path: shared_account_ids
max_definition_level: 1
max_repetition_level: 1
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
Athena table alternation
alter TABLE order_transactions_with_projections add columns (shared_account_ids array);
Simple Query that fails
With T1 as (( SELECT DISTINCT record_id, msg_type, rec, time_sent, shared_account_ids FROM ledger_int_dev.order_transactions_with_projections WHERE environment='int-dev-cert' AND ( epoch_day>=19082 AND epoch_day<=19172) AND company='12' )) select * from T1 ORDER BY time_sent DESC LIMIT 100
Error:
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://<file_name>.snappy.parquet (offset=0, length=48634): org.apache.parquet.io.PrimitiveColumnIO cannot be cast to org.apache.parquet.io.GroupColumnIO
How do I convert protobuf message with "repeated" field into parquet so that Athena understands the field as an ARRAY instead of primitive type ? Is intermediate conversion to AVRO necessary as mentioned in https://github.com/rdblue/parquet-avro-protobuf/blob/master/README.md or parquet-mr library can directly be used?

Date conversion in SAS (String to Date)

I import an Excel-spreadsheet using the following SAS-procedure.
%let datafile = 'Excel.xlsx'
%let tablename = myDB
proc import datafile = &datafile
out = &tablename
dbms = xlsx
replace
;
run;
One of the variables (date_variable) has the format DD.MM.YYYY. Therefore, I was defining a new format like this:
data &tablename;
set &tablename;
format date_variable mmddyy10.;
run;
Now, I would like to sort the table by that variable:
proc sort data = &tablename;
by date_variable;
run;
However, as the date_variable is defined as a string, I dont get the sorting right. How can I re-define the date_variable as a date?
Use the input function to convert the string-value containing a date representation to a date-value that can have a date-style format applied to it that will affect how the date-value is rendered in output and viewers.
The input function requires an informat as an argument. Check the format documentation, which includes an entry for:
DDMMYYw. Informat
Reads date values in the form ddmmyy or dd-mm-yy, where a special character, such as a hyphen (-), period (.), or slash (/), separates the day, month, and year; the year can be either 2 or 4 digits.
Example:
* string values contain date representations;
data have;
input date_from_excel $10.;
cards;
31.01.2019
01.02.2019
15.01.2019
run;
data want;
set have;
date = input(date_from_excel, ddmmyy10.); * convert to SAS date values;
format date ddmmyyd10.; * format to apply when rendering those values;
run;
* sort by SAS date values;
proc sort data=want;
by date;
run;
format date_variable mmddyy10.; does not convert a string to a date. It just sets a format for displaying that field etc.
Essentially I think what you are saying is date_variable is a string that looks like "31.01.2019". If that is the case, you will have to first convert it into a date value:
date_variable_converted = input(date_variable, DDMMYY10.);
You should be now able to sort using date_variable_converted which is a SAS date value.

How to convert an array<date> to array<string> in Hive

I would like to convert the array to an array string so that this ["2016-06-02","2016-06-02"] becomes 2016-06-02| 2016-06-02
Use concat_ws(string delimiter, array<string>) function to concatenate array:
select concat_ws(',',collect_set(date)) from table;
If the date field is not string, then convert it to string: concat_ws(',',collect_set(cast(date as string)))

different data types associated data skew

Today I read one article about the hive tuning. One paragraph is as follows:
Scene: user_id in the user table the field user_id INT, log table field both of type string type int. When two tables in accordance with the user_id Join operation, the default Hash operation will be allocated int id, this will cause all records of the string type id assigned to a reducer.
Solution: numeric type is converted to a string type
select * from users a
left outer join logs b
on a.usr_id = cast (b. user_id as string)
Can anybody give me some more explanation about the above opinion, I really cannot understand the words the author describe. Why "this will cause all records of the string type id assigned to a reducer." happened? Thanks in advance!
For starters you did not copy and paste / transcribe the original properly. Here is the more likely wording:
this will cause all records of the string type id assigned to a
single reducer.
The reason that would happen is that the conversion of string to int without the cast is probably turning it to 0. Therefore the hashing will put all of the id's into the same partition for the 0 values.

Table Column type for storing json data from laravel

I would like to store some json data on one of the table's column.
Eg.
{option: 1, status: false}
What is the column type be set as, long text? And is the storing json part using attribute casting ?
You can use the text or longtext type depending of your application requirements.
The maximum length of text input is: 65,535 (216−1) bytes = 64 KiB
The maximum length of longtext input is: 4,294,967,295 (232−1) bytes = 4 GiB
The usage of attribute casting is here: http://laravel.com/docs/5.1/eloquent-mutators#attribute-casting

Resources