How do I add multiple csv files to the catalog in kedro - azure-blob-storage

I have 4 csv files in Azure blob storage, with same metadata, that i want to process. How can i add them to the datacatalog with a single name in Kedro.
I checked this question
https://stackoverflow.com/questions/61645397/how-do-i-add-many-csv-files-to-the-catalog-in-kedro
But this seems to load all the files in the given folder.
But my requirement is to read only given 4 from many files in the azure container.
Example:
I have many files in azure container in which are 4 transaction csv files with names sales_<date_from>_<date_to>.csv, i want to load these 4 transaction csv files into kedro datacatalog under one dataset.

For starters, PartitionedDataSet is lazy, meaning that files are not actually loaded until you explicitly call that function. Even if you have 100 CSV files that get picked up by the PartitionedDataSet, you can select the partitions that you actually load/work with.
Second, what distinguishes these 4 files from the others? If they have a unique suffix, you can use the filename_suffix option to just select them. For example, if you have:
file_i_dont_care_about.csv
first_file_i_care_about.csv
second_file_i_care_about.csv
third_file_i_care_about.csv
fourth_file_i_care_about.csv
you can specify filepath_suffix: _file_i_care_about.csv.

Don’t think there’s a direct way to do this , you can add another subdirectory inside the blob storage with the 4 files and then use
my_partitioned_dataset:
type: "PartitionedDataSet"
path: "data/01_raw/subdirectory/"
dataset: "pandas.CSVDataSet"
Or in case the requirement of using only 4 files is not going to change anytime soon ,you might as well pass 4 files in the catalog.yml separately to avoid over engineering it.

Related

Transfer CSV files from azure blob storage to azure SQL database using azure data factory

I need to transfer around 20 CSV files inside a folder named ActivityPointer in an azure blob storage container to Azure SQL database in a single data factory pipeline, but ActivityPointer contains 20 CSV files and another folder named snapshots inside it. So when I try to create a pipeline and give * to select all the CSV files inside ActivityPointer it includes the snapshots folder too, which should not be included. Is there any possibilities to complete this task. Also I can't create another folder to transform the snapshots folder into it. What can I do now? Anyone can please help me out.
Assuming you want to copy all CSV files within ACtivityPointer folder,
You can use wildcard expression as below :
you can provide path till Active folder and than *.csv
Copy data is also considering the inner folder while using wildcards (even if we use .csv in wildcard file path). So, we have to validate whether it is a file or folder. Please look at the following demonstration.
First use Get Metadata on the required folder with field list as Child items. The debug output will be:
Now use this to iterate through child items using For each activity.
#activity('Get Metadata1').output.childItems
Inside for each, use if condition activity to check whether the current item is a file or not. Use the following condition.
#equals(item().type,'File')
When this is true, you can use copy data to complete copying the file to target table (Ignore the false case). I have create file_name parameter in my source dataset passing its value as #item().name().
This will help you to achieve your requirement. The following is the debug output. I have 4 files and 1 folder. The folder will be ignored, and the rest will be copied into the target table.

Querying multiple files in multiple folders in Azure Storage account using Azure Databricks

I have an Azure Storage account where I am storing my log files which are coming from my Azure Diagnostic. These log files are stored in multiple folders with hours and minutes. for ex: one of my file path in blob storage is like this
resourceId=/SUBSCRIPTIONS/53TestSubscriptionIDB/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=2022/m=05/d=23/h=13/m=00/
I would like to know steps on how to Query multiple files from Multiple folder at a time. for example if I have to Query data from Day 23 to Day 24 , Whats the best way to do it in Databricks.These folders contain json file with multiples lines of Json.Thanks
If you want to read all available files you can just use wildcards.
path = "resourceId=/SUBSCRIPTIONS/53TestSubscriptionIDB/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=*/m=*/d=*/h=*/m=*/*"
spark.read.option("header","true").format("csv").load(pathList)
If you only want to read a specific set of files, it would be best to generate a list of the paths you want to read, which you can use in the spark reading function.
pathList = [
"resourceId=/SUBSCRIPTIONS/53TestSubscriptionIDB/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=2022/m=05/d=23/h=13/m=00/",
"resourceId=/SUBSCRIPTIONS/53TestSubscriptionIDB/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=2022/m=05/d=23/h=13/m=01/"
]
spark.read.option("header","true").format("csv").load(pathList)
The pathList in this example you could generate programmatically according to the what files you want to process, e.g.
pathList = []
for i in range(24):
newPath = f"resourceId=/SUBSCRIPTIONS/53TestSubscriptionIDB/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=2022/m=05/d=23/h={i}/m=01/"
pathList.append(newPath)
spark.read.option("header","true").format("csv").load(pathList)
This example would read every hour (0-23) from the date 2022-05-23 at minute 1.

How to merge HDFS small files into a one large file?

I have number of small files generated from Kafka stream so I like merge small files to one single file but this merge is based on the date i.e. the original folder may have number of previous files but I only like to merge for given date files to one single file.
Any suggestions?
Use something like the code below to iterate over the smaller files and aggregate them into a big one (assuming that source contains the HDFS path to your smaller files, and target is the path where you want your big result file):
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(source)).map(_.getPath.toUri.getPath).
foreach(name => spark.read.text(name).coalesce(1).write.mode(Append).text(target))
This example assumes text file format, but you can just as well read any Spark-supported format, and you can use different formats for source and target, as well
you should be able to use .repartition(1) to write all results to 1 file. if you need to split by date, consider partitionBy("your_date_value") .
if you're working within HDFS and S3, this may also be helpful. you might actually even use s3-dist-cp and stay within HDFS.
https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5
There's a specific option to aggregate multiple files in HDFS using a --groupBy option based n a regular expression pattern. So if the date is in the file name, you can group based on that pattern.
You can develop a spark application. Using this application read the data from small files and create dataframe and write dataframe to big file in append mode.

i have a csv file with locations i need to move s3 files to new locations

I am interested in loading my data into AWS ATHENA DB
my data is compartmentalized by source_video, and in each we have 11 csv files that represent 11 tables referencing this data
ATHENA wants to load by table and not by source_video
for this i have to move these files to folders based on table name and not source_video.
I am fluent in python and bash
i know how to use the aws cli
i wish to know if there is maybe an easier way than running 4Million+ mv commands and executing them in different processes in parallel on several machines
I have a csv file that has locations of files located as children of the source_video they were created for:
I have 400,000+ source_video locations
I have 11 files in each source_video location
i.e.
+source_video1
- 11 files by type
+source_video2
- 11 files by type
+source_video3
- 11 files by type
.
.
+source_video400,000+
- 11 files by type
I wish to move them to 11 folders with 400,000+ files in each folder type
fields: videoName, CClocation, identityLocation, TAGTAskslocation, M2Location
and other locations ....
Below is an example of 2 rows of data:
pj1/09/11/09/S1/S1_IBM2MP_0353_00070280_DVR1.avi,
S1_IBM2MP_0353_00070280_DVR1.avi,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsidentities.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsTAGtasks.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_AL1CCs2data.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_AL1CCs2extendeddata.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_AL1CCsdata.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsALGsruns.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_Cg1CCsdata.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_Cg1CCfeat.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsidentitiestaggers.csv
pj1/09/11/09/S1/S1_IBM2MP_0443_00070380_DVR1.avi,
S1_IBM2MP_0443_00070380_DVR1.avi,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsidentities.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsTAGtasks.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_AL1CCs2data.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_AL1CCs2extendeddata.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_AL1CCsdata.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsALGsruns.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_Cg1CCsdata.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_Cg1CCfeat.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsidentitiestaggers.csv
You are correct. Athena expects all files related to one table to be located in one directory, or in subdirectories of one directory.
Given that you are going to touch so many files, you could choose to process the files rather than simply moving them. For example, putting the contents of several files into a smaller number of files. You could also consider Zipping the files because this would cost you less to scan (Athena is charged based upon data read from disk -- zip files read less data and therefore cost less).
See: Analyzing Data in S3 using Amazon Athena
This type of processing could be done efficiently on an Amazon EMR cluster that runs Hadoop, but some specialist knowledge is required to run Hadoop so it might be easier to use the coding with which you are familiar (eg Python).

What is the best place to store multiple small files in hadoop

I will be having multiple small text files around size of 10KB, got confused where to store those files in HBase or in HDFS. what will be the optimized storage?
Because to store in HBase I need to parse it first then save it against some row key.
In HDFS I can directly create a path and save that file at that location.
But till now whatever I read, it says you should not have multiple small files instead create less big files.
But I can not merge those files, so I can't create big file out of small files.
Kindly suggest.
A large number of small files don´t fit very well with hadoop since each file is a hdfs block and each block require a one Mapper to be processed by default.
There are several options/strategies to minimize the impact of small files, all options require to process at least one time small files and "package" them in a better format. If you are planning to read these files several times, pre-process small files could make sense, but if you will use those files just one time then it doesn´t matter.
To process small files my sugesstion is to use CombineTextInputFormat (here an example): https://github.com/lalosam/HadoopInExamples/blob/master/src/main/java/rojosam/hadoop/CombinedInputWordCount/DriverCIPWC.java
CombineTextInputFormat use one Mapper to process several files but could require to transfer the files to a different DataNode to put files together in the DAtaNode where the map is running and could have a bad performance with speculative tasks but you can disable them if your cluster is enough stable.
Alternative to repackage small files are:
Create sequence files where each record contains one of the small files. With this option you will keep the original files.
Use IdentityMapper and IdentityReducer where the number of reducers are less than the number of files. This is the most easy approach but require that each line in the files be equals and independents (Not headers or metadata at the beginning of the files required to understand the rest of the file).
Create a external table in hive and then insert all the records for this table into a new table (INSERT INTO . . . SELECT FROM . . .). This approach have the same limitations than the option two and require to use Hive, the adventage is that you don´t require to write a MapReduce.
If you can not merge files like in option 2 or 3, my suggestion is to go with option 1
You could try using HAR archives: https://hadoop.apache.org/docs/r2.7.2/hadoop-archives/HadoopArchives.html
It's no problem with having many small different files. If for example you have a table in Hive with many very small files in hdfs, it's not optimal, better to merge these files into less big ones because when reading this table a lot of mappers will be created. If your files are completely different like 'apples' and 'employees' and can not be merged than just store them as is.

Resources