I have 4 csv files in Azure blob storage, with same metadata, that i want to process. How can i add them to the datacatalog with a single name in Kedro.
I checked this question
https://stackoverflow.com/questions/61645397/how-do-i-add-many-csv-files-to-the-catalog-in-kedro
But this seems to load all the files in the given folder.
But my requirement is to read only given 4 from many files in the azure container.
Example:
I have many files in azure container in which are 4 transaction csv files with names sales_<date_from>_<date_to>.csv, i want to load these 4 transaction csv files into kedro datacatalog under one dataset.
For starters, PartitionedDataSet is lazy, meaning that files are not actually loaded until you explicitly call that function. Even if you have 100 CSV files that get picked up by the PartitionedDataSet, you can select the partitions that you actually load/work with.
Second, what distinguishes these 4 files from the others? If they have a unique suffix, you can use the filename_suffix option to just select them. For example, if you have:
file_i_dont_care_about.csv
first_file_i_care_about.csv
second_file_i_care_about.csv
third_file_i_care_about.csv
fourth_file_i_care_about.csv
you can specify filepath_suffix: _file_i_care_about.csv.
Don’t think there’s a direct way to do this , you can add another subdirectory inside the blob storage with the 4 files and then use
my_partitioned_dataset:
type: "PartitionedDataSet"
path: "data/01_raw/subdirectory/"
dataset: "pandas.CSVDataSet"
Or in case the requirement of using only 4 files is not going to change anytime soon ,you might as well pass 4 files in the catalog.yml separately to avoid over engineering it.
I want to combine data which is present in different files which are in turn stored in different folders. I've tried all I can but it seems I can only combine data from files present in a single folder. Is there any other way through which I can accomplish this?
You may create two queries (for each folder) and then combine them:
= Folder.Files("C:\folder1") & Folder.Files("C:\folder2")
I need to modify one file in 20 different nodes. Problem here is that all the files are different in lenght an number of rows and values on those nodes.
So i need to modify those files with new one created but problem is that some unique data should remain in old files too.
How should I implement this on nodes by overwriting existing file with new one with remaining some unique data. In this case it will be different host name.
I need to do this using Chef and Ruby.
Any ideas?
Thanks.
I want to compare the data present in the same files with the same name present in two different directories i.e. I have 10 files present in LND folder and the same 10 files present in WIP folder. My query is that how i can compare the data of the files in these directories. Kindly Help . Thanks
If your files are textual, what about using diff function?
I am grepping a few extremely large csv files(around 24 million rows each) using two mutually exclusive regex's to filter rows. I cannot share the regex's or the files(not that you would ever want to download them).
The idea is that rows that match regex A get piped into file A. Rows that match regex B get piped into file B.
What I end up with is about 5 million extra rows in the target files after this process completes.
The regex's are guaranteed to be mutually exclusive, and the line counts are correct.
The task is running on an Amazon EC2 node. Has anyone ever seen this kind of issue when running grep in the cloud?
Using awk instead seems to fix the problem.
Thanks Barmar!