Can Fivetran convert multiple CSV files to Parquet? - fivetran

I have the following folder hierarchy in an S3 bucket:
January/10 16b516c0-8f2a-eabd-770a-b8bbc83c5859.csv, 16b516c0-8f2a-eabd-770a-b8bbc83c5859.csv, …
In other words, every folder represents a calendar day. I would like Fivetran to do the following transform: January/10 AsingleParquetFile.Parquet
How can I implements this in Fivetran ETL?

Related

NiFi - ListS3 listing files with exact path but different dates

I have to list and collect several specific files on a S3 bucket from different dates.
The path looks like the following:
/path/to/20221201/files/specificfile/
/path/to/20221202/files/specificfile/
The "files" and "specificfile" folders contain several different files, where I am only interest in a specific one from each.
I tried changing the date with * thinking it would list any date, but I get no result.
Any suggestions?
Thanks

Googe Cloud Healthcare api: How can I get a list of all the datasets with the names of data stores in each of them?

I am using the Google cloud healthcare api to store dicom images. I need a way to fetch a list of all the dataset in a project with names of all the data stores in each dataset.
so possible output would look like this:
{ dataset1: [ "dataStoreA", "dataStoreB", ],
dataset2: [ "dataStoreC", "dataStoreD", ],
dataset3: [ "dataStoreE", "dataStoreF", ],
}
I am able to fetch the list of all the datasets using
healthcare.projects.locations.datasets.list({
parent: `projects/${project}/locations/${location}`
})
but that only returns an object for each dataset containing only the name of the dataset. I want to get name of all the data stores in that datasets as well.
My current way is to fetch a list of all the datasets and then for each dataset, fetch all the data stores in it. Using
healthcare.projects.locations.datasets.dicomStores
.list({
parent: `projects/${project}/locations/${location}/datasets/${dataset}`,
})
but this very time consuming because if there are 5 datasets, then I have to make 5 different requests to the api. I have looked the Google docs for the api and searched around without any luck.
The method dicomStores.list is designed to list the DICOM stores in the given dataset.
There is no method that will list all of the DICOM stores across datasets.
The current method you are using is the only option.

Check if the input file names with the file names in the config table

I have a folder which contains many files and I got a configuration table in sql database which contains the list of file names which I need to load to Azure Blob Storage.
I tried getting the file names from the source folder using 'Get Metadata' activity and then used Filter activity to filter the file name but this way I have to hard code the filename inside the filter.
Can someone please let me know a way to do this?
here is an example:
I have below files in a folder.
And the below in sql Config table
This is how the sample pipeline looks like.
1. Lookup list of files from sql config table and using foreach actvity append to an array variable. In my example it is in config_files.
2. Using GetMetadata, list the childItems in the folder, and append the file names into another variable. In my example it is files
3. Use SetVariable activity to store the result i.e. the files that match from the entries in config table.
Expression: #intersection(variables('files'),variables('config_files'))

How to pass a folder of images to a dog classification CNN model for training using tensorflow?

I have a folder of 10000 images of 120 different breeds of dogs with each individual image having a unique id example: 000bec180eb18c7604dcecc8fe0dba07 and each such id has a corresponding label name in another CSV file. What should I do to pass these images in mini-batches to a CNN?
You can use tf.keras.preprocessing.image.DirectoryIterator. But before using it you'll need to preprocess your images so that each category will have its own directory.
Preprocess step will something be like follows
1. Get name for each file in the folder.
2. For each file, look up its category in the csv
add it to a maintained list of category files
3. Create sub-directory for each category
4. Move the files in their respective directory
5. Now use the `DirectoryIterator`

Power Query From Folder as Merge, Not Append

I need to import multiple files from a folder and I need each file's contents to be new columns in the resultant table.
There are multiple examples all over the web of how to include multiple files from a folder as an append (e.g., PowerQuery multiple files and add column) but I need the contents of each file to be merged as new columns in the original table.
Any help will be greatly appreciated.
I came up with my own answer. Once you append the files you can pivot on the file name to turn them into columns.

Resources