Pyarrow schema with Timestamp unit 's' when written to Parquet changed to 'ms' upon reloaded - parquet

As seen below, the "dob" field was of type timestamp([s]) when written to Parquet format with pq.write_metadata. But upon rereading the metadata, the type changed to timestamp[ms]
Python 3.11.1 (main, Jan 26 2023, 10:38:20) [GCC 8.5.0 20210514 (Red Hat 8.5.0-18)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa, pyarrow.parquet as pq
>>> schema = pa.schema([ pa.field("dob", pa.timestamp('s')) ])
>>> schema
dob: timestamp[s]
>>> pq.write_metadata(schema, '_common_schema')
>>> reloaded_schema = pq.read_schema('_common_schema')
>>> reloaded_schema
dob: timestamp[ms]
>>>
Is this because Parquet format does not support Timestamp of unit second?
How can I make the schema exactly the same in this case?

The behavior you're observing is likely due to the fact that the default Timestamp unit in Pyarrow is microseconds (us), whereas the default Timestamp unit in Parquet is milliseconds (ms). When you write a Pyarrow schema with a Timestamp unit of s to a Parquet file, it gets converted to ms upon storage. When you reload the file, the stored ms unit is used, so the schema gets reloaded as ms. To avoid this behavior, you can specify the Timestamp unit in Pyarrow as ms when writing to Parquet and then ensure that the same unit is used when reading the file back.
The Parquet format does not support Timestamp of unit second (s). Instead, the default unit for Timestamp in Parquet is milliseconds (ms). This means that when a Pyarrow schema with a Timestamp field of unit second is written to a Parquet file, it is automatically converted to a Timestamp field of unit milliseconds upon storage. When the file is reloaded, the stored Timestamp unit of milliseconds is used, so the reloaded schema will show the Timestamp field as having a unit of milliseconds.
You can use:
import pyarrow as pa
import pyarrow.parquet as pq
# Specify the Timestamp unit as milliseconds
schema = pa.schema([ pa.field("dob", pa.timestamp('ms')) ])
# Write the schema to a Parquet metadata file
pq.write_metadata(schema, '_common_schema')
# Read the schema back from the metadata file
reloaded_schema = pq.read_schema('_common_schema')
# The reloaded schema should now show the Timestamp field as having a unit of milliseconds
print(reloaded_schema)
This should result in the expected behavior, where the Timestamp field is correctly represented as having a unit of milliseconds, both when written to the Parquet file and when reloaded from the file.
there are a few other data types that can be represented differently in Arrow and Parquet. Here are some to be aware of:
Decimal: The precision and scale of Decimal fields in Arrow and Parquet can be different. When converting from Arrow to Parquet, the decimal type is rounded to the nearest representable decimal with the same scale. When converting from Parquet to Arrow, the decimal type is rounded up to the nearest representable decimal with the same precision.
Timestamp: As we have seen, the default unit for Timestamps in Arrow is microseconds (us), whereas the default unit for Timestamps in Parquet is milliseconds (ms). You should ensure that the correct unit is specified when converting between the two formats.
Time: The default unit for Time in Arrow is microseconds (us), whereas the default unit for Time in Parquet is milliseconds (ms). You should ensure that the correct unit is specified when converting between the two formats.
Nested structures: Arrow supports nested structures, such as arrays and structs, whereas Parquet only supports flat structures. When converting from Arrow to Parquet, any nested structures must be flattened. When converting from Parquet to Arrow, the flat structure must be reconstructed into nested structures.
These are some of the main differences to be aware of when converting between Arrow and Parquet data formats. It's important to ensure that the data is correctly represented in both formats to avoid unexpected behavior and data loss

Related

Azure Data Factory Converting Source Data Type to a Different Format

I am using Azure Data Factory to copy data from an Oracle Database to ADLS Gen 2 Container
In the COPY Activity, I added Source as Oracle DB and Sink as ADLS
I want to create Parquet file in Sink
When I click on Mapping, I can see the datatype which is NUMBER in Source is getting converted as Double in ADF
Also, Date type in source is converted to DateTime in ADF
Due to which I am not able to load correct data
I even tried Typecasting in Source Query to convert it into same format as source but still ADF is converting it into Double
Pls find below screenshot as a reference:
Here ID column is NUMBER in Oracle DB, but ADF is considering it as Double and adding .0 to the data which is not what I need
Even after typecasting it to Number it is not showing correct type
What can be the possible root cause of this issue and why the Source data type is not shown in correct format
Due to this, the Parquet file which I am creating is not correct and my Synapse Table (end destination) is not able to add the data as in Synapse I have kept ID column as Int
Ideally, ADF should show the same data type as in Source
Pls let me know if you have any solution or suggestions for me to try
Thanks!
I am not an Oracle user, but as I understand it the NUMBER data type is generic and can be either integer or decimal based. Parquet does not have this concept, so when it converts it basically has to be to a decimal type (such as Double) to prevent loss of data. If you really want the data to be an Integer, then you'll need to use Data Flow (instead of COPY) to cast the incoming values to an integer column.

ADF force format stored in parquet from copy activity

I've created an ADF pipeline that converts a delimited file to parquet in our datalake. I've added an additional column and set the value using the following expression #convertfromutc(utcnow(),'GMT Standard Time','o'). The problem I am having is when I look at the parquet file it is coming back in the US format.
eg 11/25/2021 14:25:49
Even if I use #if(pipeline().parameters.LoadDate,json(concat('[{"name": "LoadDate" , "value": "',formatDateTime(convertfromutc(utcnow(),'GMT Standard Time','o')),'"}]')),NULL) to try to force the format on the extra column it still comes back in the parquet in the US format.
Any idea why this would be and how I can get this to output into parquet as a proper timestamp?
Mention the format pattern while using convertFromUtc function as shown below.
#convertFromUtc(utcnow(),’GMT Standard Time’,’yyyy-MM-dd HH:mm:ss’)
Added date1 column in additional columns under source to get the required date format.
Preview of source data in mappings. Here data is previewed as giving format in convertFromUtc function.
Output parquet file:
Data preview of the sink parquet file after copying data from the source.

Issue with timestamp field while using SQOOP

An extra space is added before the milliseconds while timestamp field is being ingested for eg. 05-OCT-17 03.39.02.689000000 AM is ingested as
2017-10-5 3:39:2. 689000000. Using Oracle as the source,parquet format as the format for storing the data in HDFS.
Any suggestions on how it can be avoided.

Hadoop Input Formats - Usage

I know different file formats in Hadoop ? By default hadoop uses text input format. what is advantage/disadvantage of using text input format.
What is advantage/disadvantage of avro over text input format.
Also please help me understand use case for different file formats(Avro, Sequence, TextInput, RCFile ).
I believe there are no advantages of Text as default other than its contents are human readable and friendly. You could easily view contents by issuing Hadoop fs -cat .
The disadvantages with Text format are
It takes more resources on disk, so would impact the production job efficiency.
Writing/Parsing the text records take more time
No option to maintain data types incase the text is composed from multiple columns.
The Sequence , Avro , RCFile format have very significant advantages over Text format.
Sequence - The key/value objects are directly stored in the binary format through the Hadoop's native serialization process by implementing Writable interface. The data types of the columns are very well maintained, and parsing the records with relevant data type also done easily. Obvoiusly it takes lesser space compared with Text due to the binary format.
Avro - Its a very compact binary storage format for hadoop key/value pairs, Reads/writes records through Avro serialization/deserialization. It is very similar to Sequence file format but also provides Language interoperability and cell versioning.
You may choose Avro over Sequence only if u need cell versioning or the data to be stored will used by few other applications written in different languages other than Java.Avro files can be processed by any languages like C, Ruby, Python, PHP, Java wherein Sequence files are specific only for Java.
RCFile - The Record Columnar File format is column oriented and it is a Hive specific storage format designed to make hive to support faster data load, reduce storage space.
Apart from this you may also consider the ORC and the Parquet file formats.

hive hbase integration timestamp

I would like to store table into HBase using Hive (hive hbase integration )
My table contains a field typed TIMESTAMP (like DATE)
I've done some research and i discovered that TIMESTAMP is not supported by HBASE, some what should I do?
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating dat at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:80)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:529) ... 9 more Caused by: java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
at java.sql.Timestamp.valueOf(Timestamp.java:185)
at org.apache.hadoop.hive.serde2.lazy.LazyTimestamp.init(LazyTimestamp.java:74)
at org.apache.hadoop.hive.serde2.lazy.LazyStruct.uncheckedGetField(LazyStruct.java:219)
at org.apache.hadoop.hive.serde2.lazy.LazyStruct.getField(LazyStruct.java:192)
at org.apache.hadoop.hive.serde2.lazy.objectinspector.LazySimpleStructObjectInspector.getStructFieldData(LazySimpleStructObjectInspector.java:188)
at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.evaluate(ExprNodeColumnEvaluator.java:98)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:76)
The easiest thing to do would be to convert the TIMESTAMP into a STRING, INT, or FLOAT. This will have the unfortunate side effect of giving up Hive's built in TIMESTAMP support. Due to this you will lose
Read time checks to make sure your column contains a valid TIMESTAMP
The ability to transparently use TIMESTAMPSs of different formats
The use of Hive UDFs which operate on TIMESTAMPs.
The first two losses are mitigated if you choose a single format for your own timestamps and stick to it. The last is not a huge loss because only two Hive date functions actually operate on TIMESTAMPs. Most of them operate on STRINGs. If you aboslutely needed from_utc_timestamp and from_utc_timestamp, you can write your own UDF.
If you go with STRING and only need the date, I would go with a yyyy-mm-dd format. If you need the time as well go with yyyy-mm-dd hh:mm:ss, or yyyy-mm-dd hh:mm:ss[.fffffffff] if you need partial second timestamps. This format also is also consistent with how Hive expects TIMESTAMPs and is the form required for most Hive date functions.
If you with INT you again have a couple of options. If only the date is important, YYYYMMDD fits in with the "basic" format of ISO 8601 (This is a form I've personally used and found convenient when I didn't need to perform any date operations on the column). If the time is also important, go with YYYYMMDDhhmmss. This an acceptable variant for the basic form of ISO 8601 for date time. If you need fractional second timing, then use a FLOAT and the form YYYYMMDDhhmmss.fffffffff. Note that neither of these forms is consitent with how Hive expects integer or floating point TIMESTAMPs.
If the concept of calendar dates and time of day isn't important at all, then using an INT as a Unix timestamp is probably the easiest, or a FLOAT if you also need fractional seconds. This form is consistent with how Hive expects TIMESTAMPs.

Resources