kedro DataSetError while loading PartitionedDataSet

kedro DataSetError while loading PartitionedDataSet - kedro

I am using PartitionedDataSet to load multiple csv files from azure blob storage. I defined my data set in the datacatalog as below.
my_partitioned_data_set:
type: PartitionedDataSet
path: my/azure/folder/path
credentials: my credentials
dataset: pandas.CSVDataSet
load_args:
sep: ";"
encoding: latin1
I also defined a node to combine all the partitions. But while loading each file as a CSVDataSet kedro is not considering the load_args, so I am getting the below error.
Failed while loading data from data set CSVDataSet(filepath=my/azure/folder/path, load_args={}, protocol=abfs, save_args={'index': False}).
'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
The error shows that while loading the CSVDataSet kedro is not considering the load_args defined in the PartitionedDataSet. And passing an empty dict as a load_args parameter to CSVDataSet.
I am following the documentation
https://kedro.readthedocs.io/en/stable/05_data/02_kedro_io.html#partitioned-dataset
I am not getting where I am doing mistakes.

Move load_args inside dataset
my_partitioned_data_set:
type: PartitionedDataSet
path: my/azure/folder/path
credentials: my credentials
dataset:
type: pandas.CSVDataSet
load_args:
sep: ";"
encoding: latin1
load_args mentioned outside dataset is passed into find() method of the corresponding filesystem implementation
To pass granular configuration to underlying dataset put it inside dataset as above.
You can check out the details in the docs
https://kedro.readthedocs.io/en/stable/05_data/02_kedro_io.html?highlight=partitoned%20dataset#partitioned-dataset-definition

Related

Trino unexpected partition name error =__HIVE_DEFAULT_PARTITION__ != []

The table is of parquet format and sits in Minio storage. I create the parquet file. After the parquet file is inserted into Minio, I run the following command to add the new partition:
call system.sync_partition_metadata('myschema', 'mytable', 'ADD', true)
However, it errors out
[Code: 65551, SQL State: ] Query failed (#20221210_041833_00033_4iqfp): unexpected partition name: mypartitionfield=HIVE_DEFAULT_PARTITION != []
Earlier on I did insert an empty partition value. How can I remedy this issue? The mypartitionfield is of type date.
I tried dropping and creating the table, as well as deleting the Minio folder structure and creating it again. None worked.

Azure Data Factory - How to filter out specific files in multiple Zip. files?

I have set up a ADF pipeline that gets a set of .Zip files from Azure Storage, and iterates through each Zip file's folders and files to land them in an output container with preserved hierarchy.
Get Metadata:
For Each:
Issue:
The issue is that there is a specific .PDF file (ASC_NTS.pdf) that is embedded within each .Zip file that has the same name:
It is causing this error when trying to run the pipeline:
Error
Operation on target ForEach1 failed: Activity failed because an inner activity failed; Inner activity name: Copy data1, Error: ErrorCode=AdlsGen2OperationFailedConcurrentWrite,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error occurred when trying to upload a file. It's possible because you have multiple concurrent copy activities runs writing to the same file 'FAERS_output/ascii/ASC_NTS.pdf'. Check your ADF configuration.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=ADLS Gen2 operation failed for: Operation returned an invalid status code 'PreconditionFailed'. Account: 'asastgssuaefdbhdg2dbc4'. FileSystem: 'curated'. Path: 'FAERS_output/ascii/ASC_NTS.pdf'. ErrorCode: 'LeaseIdMissing'. Message: 'There is currently a lease on the resource and no lease ID was specified in the request.'. RequestId: 'b21022a6-b01f-0031-641a-453ab6000000'. TimeStamp: 'Thu, 31 Mar 2022 16:15:56 GMT'..,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.Azure.Storage.Data.Models.ErrorSchemaException,Message=Operation returned an invalid status code 'PreconditionFailed',Source=Microsoft.DataTransfer.ClientLibrary,'
Is there a workaround for this pipeline setup that allows me to filter within the For Each loop? I just need the .TXT files, the .PDF files can be discarded.
This was the closest reference I could find, but does not address my use case:
Filter out file using wildcard path azure data factory

Have you tried using an If Condition activity? You can set the expression to check for the correct file extension.

What path should be used to locate CSV File used in SQL statement (Load data local infile) When WAR file Deployed on Tomcat

I have been working on Spring Boot project, I am using Flyway for database version control in this project. In migration folder there are some SQL files having "Load data local infile" Statements - referencing some CSV files.
Example:
load data local infile 'C:/Program Files (x86)/Apache Software Foundation/Tomcat 8.5/webapps/originator/WEB-INF/classes/insertData/subject.csv' INTO TABLE subject
How can I make this path relative?
I have tried
'./classes/insertData/subject.csv'
'./insertData/subject.csv'
And some other combinations also but could not fixed this issue
Error:
Caused by: java.sql.SQLException: Unable to open file '../../insertData/subject.csv'for 'LOAD DATA LOCAL INFILE' command.Due to underlying IOException:
BEGIN NESTED EXCEPTION java.io.FileNotFoundException MESSAGE:
....\insertData\subject.csv (The system cannot find the path
specified) STACKTRACE: java.io.FileNotFoundException:
....\insertData\subject.csv (The system cannot find the path
specified)

I was able to insert data into tables from CSV files within a flyway migration from a resource path. Within the migration script I used the entire path written as shown below.
LOAD DATA LOCAL INFILE './src/main/resources/<FOLDER>/<FILE>.csv' INTO TABLE <TABLE_NAME>
FIELDS TERMINATED BY ','
optionally enclosed by '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES;
The other statements would be dependent on your file structure I just wanted to include the entire example.

Instead of writing SQL script you can use Java-based migration to read and insert data into a table. You can use "flyway.locations" property to specify the path for java based migration in your application.properties. As flyway by default search for "./db/migration" of resources.
For further details check the https://flywaydb.org/documentation/migrations#java-based-migrations

Required field 'uncompressed_page_size' was not found in serialized data! Parquet

I am getting below error while trying to save parquet file from local directory using pyspark.
I tried spark 1.6 and 2.2 both give same error
It display's schema properly but gives error at the time of writing file.
base_path = "file:/Users/xyz/Documents/Temp/parquet"
reg_path = "file:/Users/xyz/Documents/Temp/parquet/ds_id=48"
df = sqlContext.read.option( "basePath",base_path).parquet(reg_path)
out_path = "file:/Users/xyz/Documents/Temp/parquet/out"
df2 = df.coalesce(5)
df2.printSchema()
df2.write.mode('append').parquet(out_path)
org.apache.spark.SparkException: Task failed while writing rows
Caused by: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: PageHeader(type:null, uncompressed_page_size:0, compressed_page_size:0)

In my own case, I was writing a custom Parquet Parser for Apache Tika and I experienced this error. It turned out that if the file is being used by another process, the ParquetReader will not be able to access uncompressed_page_size. Hence, causing the error.
Verify if other processes are not holding on to the file.

Temporary resolved by the spark config:
"spark.sql.hive.convertMetastoreParquet": "false"
Although it would has extra cost, but a walkaround approach by now.

traefik - basic auth for entry point via key/value store

I would like to configure basic auth for one of my entry points via key/value store (consul in my case). But traefik seems to ignore the directives.
I tried the following configurations:
traefik/entrypoints/http/auth/basic/users = ["test:$apr1$H6uskkkW$IgXLP6ewTrSuBkTrqE8wj/"]
traefik/entrypoints/http/auth/basic/users = test:$apr1$H6uskkkW$IgXLP6ewTrSuBkTrqE8wj/
traefik/entrypoints/http/auth/basic/users/0 = test:$apr1$H6uskkkW$IgXLP6ewTrSuBkTrqE8wj/
I get the following error
-------------------------------------
/var/log/containers/traefik-c9f95e2d3a98-stdouterr.log
-------------------------------------
2017/06/12 15:58:34 Error loading configuration: 1 error(s) decoding:
* error decoding 'EntryPoints[http].Auth.Basic.Users': illegal base64 data at input byte 5
The toml file seems to be ignored if I specify a key/value store...
What am i doing wrong?

I figured out what was wrong.
If you provide a key/value store like consul it will override the configuration in your config file.
The correct key or path to store basic auth users looks like this
traefik/entrypoints/http/auth/basic/users/0
and the value you is the username and the hash separated by a colon
test:$apr1$H6uskkkW$IgXLP6ewTrSuBkTrqE8wj/
If you get the above error message regarding base64 encoding you have to escape the $ with $ so your hash looks like $$apr1$$H6uskkkW$IgXLP6ewTrSuBkTrqE8wj/

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio