Azure Data Factory Converting Source Data Type to a Different Format - oracle

I am using Azure Data Factory to copy data from an Oracle Database to ADLS Gen 2 Container
In the COPY Activity, I added Source as Oracle DB and Sink as ADLS
I want to create Parquet file in Sink
When I click on Mapping, I can see the datatype which is NUMBER in Source is getting converted as Double in ADF
Also, Date type in source is converted to DateTime in ADF
Due to which I am not able to load correct data
I even tried Typecasting in Source Query to convert it into same format as source but still ADF is converting it into Double
Pls find below screenshot as a reference:
Here ID column is NUMBER in Oracle DB, but ADF is considering it as Double and adding .0 to the data which is not what I need
Even after typecasting it to Number it is not showing correct type
What can be the possible root cause of this issue and why the Source data type is not shown in correct format
Due to this, the Parquet file which I am creating is not correct and my Synapse Table (end destination) is not able to add the data as in Synapse I have kept ID column as Int
Ideally, ADF should show the same data type as in Source
Pls let me know if you have any solution or suggestions for me to try
Thanks!

I am not an Oracle user, but as I understand it the NUMBER data type is generic and can be either integer or decimal based. Parquet does not have this concept, so when it converts it basically has to be to a decimal type (such as Double) to prevent loss of data. If you really want the data to be an Integer, then you'll need to use Data Flow (instead of COPY) to cast the incoming values to an integer column.

Related

Error when running NiFi ExecuteSQL processor. Datetime column can not be converted to Avro format

everyone. I'm learning about some NiFi processors.
I want to obtain all the data of several tables automatically.
So I used a ListDatabaseTable processor with the aim of getting the tables names that are in a specific catalog.
After that, I used other processors to generate the queries like GenerateTableFetch and
RemplaceText. Everything works perfectly since here.
Finally, ExecuteSQL processor plays a role, and here and error is displayed. It says that a datetime column can not be converted to Avro format.
The problem is that there are several tables so specify those columns would be complicated to cast them.
Is a possible solution to fix the error?
The connection is with Microsoft SQL Server.
Here is the image of my flow :

ADF force format stored in parquet from copy activity

I've created an ADF pipeline that converts a delimited file to parquet in our datalake. I've added an additional column and set the value using the following expression #convertfromutc(utcnow(),'GMT Standard Time','o'). The problem I am having is when I look at the parquet file it is coming back in the US format.
eg 11/25/2021 14:25:49
Even if I use #if(pipeline().parameters.LoadDate,json(concat('[{"name": "LoadDate" , "value": "',formatDateTime(convertfromutc(utcnow(),'GMT Standard Time','o')),'"}]')),NULL) to try to force the format on the extra column it still comes back in the parquet in the US format.
Any idea why this would be and how I can get this to output into parquet as a proper timestamp?
Mention the format pattern while using convertFromUtc function as shown below.
#convertFromUtc(utcnow(),’GMT Standard Time’,’yyyy-MM-dd HH:mm:ss’)
Added date1 column in additional columns under source to get the required date format.
Preview of source data in mappings. Here data is previewed as giving format in convertFromUtc function.
Output parquet file:
Data preview of the sink parquet file after copying data from the source.

avro data validation with the schema

I'm new to this data validation and related concept, please excuse me if it is a simple question, help me with the steps to achieve this.
Use case: Validating AVRO file (Structure and Data)
Inputs:
We are going to receive a AVRO file’s
We will have a schema file in a note pad (ex- field name, data type and size etc)
Validation:
Need to validate AVRO file with structure (schema-i.e field, data type, size etc)
Need to validate number and decimal format while viewing from Hive
So far I'm able to achieve is to get the schema from the avro file using the avro jar.

apache drill memory exception

I am trying to reformat over 600gb of csv files into parquet using apache drill in a single node setup.
I run my sql statement:
CREATE TABLE AS Data_Transform.'/' AS
....
FROM Data_source.'/data_dump/*'
and it is creating parquet files but I get the error:
Query Failed: An Error Occurred
org.apache.drill.common.exceptions.UserRemoteException: RESOURCE ERROR:
One or more nodes ran out of memory while executing the query.
is there a way around this?
Or is there an alternative way to do the conversion?
I don't know if querying all those GB on a local node is feasible. If you've configured the memory per the docs, using a cluster of Drillbits to share the load is the obvious solution, but I guess you already know that.
If you're willing to experiment, and you're converting csv files using a select * to query the csv, rather than selecting individual columns, change the query to something like select columns[0] as user_id, columns1 as user_name. Cast any columns to types like int, float, datetime if possible. This avoids the read overhead storing data in the varchars and prepares data for your future queries that need to be cast for any analysis.
I've also seen the following recommendation from a Drill developer: split files into smaller files manually to overcome the local file system capability limitations. Drill doesn't split files on block splits.

AS400 to Oracle 10g via xml with Informatica Powercenter

Is the following workflow possible with Informatica Powercenter?
AS400 -> Xml(in memory) -> Oracle 10g stored procedure (pass xml as param)
Specifically, I need to take a result set eg. 100 rows. Convert those rows into a single xml document as a string in memory and then pass that as a parameter to an Oracle stored procedure that is called only once. I understood that a workflow runs row-by-row and this kind of 'batching' is not possible.
Yes, this scenario should be possible.
You can connect to AS/400 sources with native Informatica connector(s), although this might require (expensive) licenses. Another option is to extract the data from AS/400 source into a text file, and use that as a normal file source.
To convert multiple rows into one row, you would use an Aggregator transformation. You may need to create a dummy column (with same value for all rows) using an Expression, and use that column as the grouping key of the Aggregator, to squeeze the input into one single row. Row values would be concatenated together (separated by some special character) and then you would use another Expression to split and parse the data into as many ports (fields) as you need.
Next, with an XML Generator transformation you can create the XML. This transformation can have multiple input ports (fields) and its result will be directed into a single output port.
Finally, you would load the generated XML value into your Oracle target, possibly using a Stored Procedure transformation.

Resources