ADF pipeline not able to read DECIMAL(36,0) value from Parquet file - parquet

We're using a copy activity to copy parquet file data into our managed instance SQL server.
The source is using a SQL Serverless query to read the parquet files.
There's a new column coming through that is bringing in large values and causing failures e.g. 28557632721941551956925858310928928
There isn't any problem querying it straight out of Azure Data Studio using SQL Serverless.
Here's the error message:
{
"errorCode": "2200",
"message": "Failure happened on 'Source' side. ErrorCode=UserErrorInvalidDataValue,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failed to read data from source.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.OverflowException,Message=Conversion overflows.,Source=System.Data,'",
"failureType": "UserError",
"target": "Stage Parquet File Data",
"details": []
}
I also tried using a parquet file dataset for my source. This is the failure I received:
{
"errorCode": "2200",
"message": "ErrorCode=ParquetBridgeInvalidData,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Column gwcbi___seqval of primitive type FixedLenByteArray, original type Decimal contained an invalid value for the given original type.,Source=Microsoft.DataTransfer.Richfile.ParquetTransferPlugin,'",
"failureType": "UserError",
"target": "Stage Parquet File Data",
"details": []
}
This looks like a serious limitation of Synapse/ADF pipelines. Any ideas?
Thanks,
Jason

A conversion overflow means that the value was too big for the datatype it's trying to be stored in. Decimals with precision greater than 28 (BigDecimals) are not supported in ADF copy activity which is why the above issue.
As a workaround you may try casting/converting the datatype to other (for example String/varchar)
But if you have feedback to improve the ADF product, please feel free to log it in ADF IDEAS forum here - https://feedback.azure.com/d365community/forum/1219ec2d-6c26-ec11-b6e6-000d3a4f032c

Related

Azure Data Factory Converting Source Data Type to a Different Format

I am using Azure Data Factory to copy data from an Oracle Database to ADLS Gen 2 Container
In the COPY Activity, I added Source as Oracle DB and Sink as ADLS
I want to create Parquet file in Sink
When I click on Mapping, I can see the datatype which is NUMBER in Source is getting converted as Double in ADF
Also, Date type in source is converted to DateTime in ADF
Due to which I am not able to load correct data
I even tried Typecasting in Source Query to convert it into same format as source but still ADF is converting it into Double
Pls find below screenshot as a reference:
Here ID column is NUMBER in Oracle DB, but ADF is considering it as Double and adding .0 to the data which is not what I need
Even after typecasting it to Number it is not showing correct type
What can be the possible root cause of this issue and why the Source data type is not shown in correct format
Due to this, the Parquet file which I am creating is not correct and my Synapse Table (end destination) is not able to add the data as in Synapse I have kept ID column as Int
Ideally, ADF should show the same data type as in Source
Pls let me know if you have any solution or suggestions for me to try
Thanks!
I am not an Oracle user, but as I understand it the NUMBER data type is generic and can be either integer or decimal based. Parquet does not have this concept, so when it converts it basically has to be to a decimal type (such as Double) to prevent loss of data. If you really want the data to be an Integer, then you'll need to use Data Flow (instead of COPY) to cast the incoming values to an integer column.

ADF force format stored in parquet from copy activity

I've created an ADF pipeline that converts a delimited file to parquet in our datalake. I've added an additional column and set the value using the following expression #convertfromutc(utcnow(),'GMT Standard Time','o'). The problem I am having is when I look at the parquet file it is coming back in the US format.
eg 11/25/2021 14:25:49
Even if I use #if(pipeline().parameters.LoadDate,json(concat('[{"name": "LoadDate" , "value": "',formatDateTime(convertfromutc(utcnow(),'GMT Standard Time','o')),'"}]')),NULL) to try to force the format on the extra column it still comes back in the parquet in the US format.
Any idea why this would be and how I can get this to output into parquet as a proper timestamp?
Mention the format pattern while using convertFromUtc function as shown below.
#convertFromUtc(utcnow(),’GMT Standard Time’,’yyyy-MM-dd HH:mm:ss’)
Added date1 column in additional columns under source to get the required date format.
Preview of source data in mappings. Here data is previewed as giving format in convertFromUtc function.
Output parquet file:
Data preview of the sink parquet file after copying data from the source.

Flume Hive sink failed to serialize JSON with array

I am trying to load JSON data to Hive via Hive Sink.
But it fails with the following error:
WARN org.apache.hive.hcatalog.data.JsonSerDe: Error [java.io.IOException: Field name expected] parsing json text [{"id": "12345", "url": "https://mysite", "title": ["MyTytle"]}].
INFO org.apache.flume.sink.hive.HiveWriter: Parse failed : Unable to convert byte[] record into Object : {"id": "12345", "url": "https://mysite", "title": ["MyTytle"]}
Example of data:
{"id": "12345", "url": "https://mysite", "title": ["MyTytle"]}
Description of Hive table:
id string
url string
title array<string>
time string
# Partitions
time string
And the same way it works fine if JSON data doesn't contain arrays (and Hive table either).
Flume version: 1.7.0 (Cloudera CDH 5.10)
Does it possible to load JSON data with arrays via Flume Hive sink?
Is it possible to load JSON data with arrays via Flume Hive sink?
I assume it is possible, despite I never tried myself. From:
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_HDP_RelNotes/content/ch01s08s02.html
Following serializers are provided for Hive sink:
JSON: Handles UTF8 encoded Json (strict syntax) events and requires no
configuration. Object names in the JSON are mapped directly to columns
with the same name in the Hive table. Internally uses
org.apache.hive.hcatalog.data.JsonSerDe but is independent of the
Serde of the Hive table. This serializer requires HCatalog to be
installed.
So maybe you are implementing something wrong in the SerDe. This user solved the problem of serialising a JSON with arrays by performing a previous regexp:
Parse json arrays using HIVE
Another thing you may try is to change the SerDe. At least you have this two options (maybe there are some more):
'org.apache.hive.hcatalog.data.JsonSerDe'
'org.openx.data.jsonserde.JsonSerDe'
(https://github.com/sheetaldolas/Hive-JSON-Serde/tree/master)

apache drill memory exception

I am trying to reformat over 600gb of csv files into parquet using apache drill in a single node setup.
I run my sql statement:
CREATE TABLE AS Data_Transform.'/' AS
....
FROM Data_source.'/data_dump/*'
and it is creating parquet files but I get the error:
Query Failed: An Error Occurred
org.apache.drill.common.exceptions.UserRemoteException: RESOURCE ERROR:
One or more nodes ran out of memory while executing the query.
is there a way around this?
Or is there an alternative way to do the conversion?
I don't know if querying all those GB on a local node is feasible. If you've configured the memory per the docs, using a cluster of Drillbits to share the load is the obvious solution, but I guess you already know that.
If you're willing to experiment, and you're converting csv files using a select * to query the csv, rather than selecting individual columns, change the query to something like select columns[0] as user_id, columns1 as user_name. Cast any columns to types like int, float, datetime if possible. This avoids the read overhead storing data in the varchars and prepares data for your future queries that need to be cast for any analysis.
I've also seen the following recommendation from a Drill developer: split files into smaller files manually to overcome the local file system capability limitations. Drill doesn't split files on block splits.

Building a search layer index on avro serialized data

I have my avro serialized data on hdfs. Now I'm trying to build a search interface where I can query the avro data and fetch the results. I can use the following approach, but it has some disvantages:
Deserialize the avro data and add it in hive store and build a indexing layer using some solr/lucene and run the queries.
What if the avro schema has multiple layers, like
{
name: "xyz",
height: "180cm",
Cities_residing: ["X", "Y", "Z"]
Hotels_checkedin : ["X", "Y", "Z"],
itemX : {
itemY : {
itemZ : "546"
}
}
}
Now, storing the above hierarchial data record will be difficult. Also, I don't want to replicate the data like deserializing the avro records and storing in some document store. It introduces lot of replication.
So, im looking for a serach tool over avro serialized data(having multiple hierarchies).
In case if existing tools are already solving this problem. Please point me to those.
If you're working in Java, SortedKeyValueFile may be an alternative worth exploring. At this time, I am not aware of a similar implementation in python or C/C++. This is obviously not as generic as BigQuery; however, it may solve use cases where you only need to query by key within a file.
The big cloud providers now have solutions for searching through avro files.
AWS Athena and BigQuery are two examples of services that might solve your issue. Especially if you are willing to switch from hdfs to S3 or similar service.

Resources