Can I combine data present in multiple files stored in different folders using powerquery? - powerquery

I want to combine data which is present in different files which are in turn stored in different folders. I've tried all I can but it seems I can only combine data from files present in a single folder. Is there any other way through which I can accomplish this?

You may create two queries (for each folder) and then combine them:
= Folder.Files("C:\folder1") & Folder.Files("C:\folder2")

Related

How to fetch list of files under one folder in adls gen 2

I have requirement like daily i am receiving diffrent type of files like Excel,CSV,Avaro,JSON etc
I need to fetch list of files names like
tablea.xls
tablea.csv etc
I need convert all the file from different format to CSV.
This things we need to do using ADF.
Thanks ,
Use the Get Metadata activity to list files and the Copy activity to convert the format. Copy can change formats but can not do much in the way of transform. Specify the format you want in the Sink section of the Copy config. Try some things out and some tutorials and come back if you get specific errors.

How do I add multiple csv files to the catalog in kedro

I have 4 csv files in Azure blob storage, with same metadata, that i want to process. How can i add them to the datacatalog with a single name in Kedro.
I checked this question
https://stackoverflow.com/questions/61645397/how-do-i-add-many-csv-files-to-the-catalog-in-kedro
But this seems to load all the files in the given folder.
But my requirement is to read only given 4 from many files in the azure container.
Example:
I have many files in azure container in which are 4 transaction csv files with names sales_<date_from>_<date_to>.csv, i want to load these 4 transaction csv files into kedro datacatalog under one dataset.
For starters, PartitionedDataSet is lazy, meaning that files are not actually loaded until you explicitly call that function. Even if you have 100 CSV files that get picked up by the PartitionedDataSet, you can select the partitions that you actually load/work with.
Second, what distinguishes these 4 files from the others? If they have a unique suffix, you can use the filename_suffix option to just select them. For example, if you have:
file_i_dont_care_about.csv
first_file_i_care_about.csv
second_file_i_care_about.csv
third_file_i_care_about.csv
fourth_file_i_care_about.csv
you can specify filepath_suffix: _file_i_care_about.csv.
Don’t think there’s a direct way to do this , you can add another subdirectory inside the blob storage with the 4 files and then use
my_partitioned_dataset:
type: "PartitionedDataSet"
path: "data/01_raw/subdirectory/"
dataset: "pandas.CSVDataSet"
Or in case the requirement of using only 4 files is not going to change anytime soon ,you might as well pass 4 files in the catalog.yml separately to avoid over engineering it.

Combine multiple VCF files into one large VCF file

I have a list of VCF files from specific ethnicity such as American Indian, Chinese, European, etc
Under each ethnicity, I have around 100+ files.
Currently, I computed the VARIANT QC metrics such as call_rate, n_het etc for one file as shown in the hail tutorial (refer image below)
image is here
However, now I would like to have one file for each ethnicity and then compute VARIANT_QC metrics.
I already referred to this post and this post but don't think this addresses my query
How can I do this across all files under a specific ethnicity?
Can help me with this?
Is there any hail/python/R/other tools way to do this?
You could use Variant Transforms to achieve this goal. Variant Transforms is a tool for parsing and importing VCF files into BigQuery. It also can perform the reverse transform: export variants stored in BigQuery tables to VCF file. So basically you need to:  multiple VCF files -> BigQuery -> Single VCF file
Variant Transforms can easily handle multiple input files. It also can perform more complex logic to merge same variants across multiple files into the same record. After your variants are all loaded into BigQuery you could export them to VCF file.
Note that Variant Transforms creates a separate table for each chromosome to optimize query costs. You can easily create a VCF file for each chromosome and then merge them together to create a single one.
You can reach out to Variant Transforms team if you need help with this task.

How to merge HDFS small files into a one large file?

I have number of small files generated from Kafka stream so I like merge small files to one single file but this merge is based on the date i.e. the original folder may have number of previous files but I only like to merge for given date files to one single file.
Any suggestions?
Use something like the code below to iterate over the smaller files and aggregate them into a big one (assuming that source contains the HDFS path to your smaller files, and target is the path where you want your big result file):
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(source)).map(_.getPath.toUri.getPath).
foreach(name => spark.read.text(name).coalesce(1).write.mode(Append).text(target))
This example assumes text file format, but you can just as well read any Spark-supported format, and you can use different formats for source and target, as well
you should be able to use .repartition(1) to write all results to 1 file. if you need to split by date, consider partitionBy("your_date_value") .
if you're working within HDFS and S3, this may also be helpful. you might actually even use s3-dist-cp and stay within HDFS.
https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5
There's a specific option to aggregate multiple files in HDFS using a --groupBy option based n a regular expression pattern. So if the date is in the file name, you can group based on that pattern.
You can develop a spark application. Using this application read the data from small files and create dataframe and write dataframe to big file in append mode.

Comparing Data in Files present in two different Directories

I want to compare the data present in the same files with the same name present in two different directories i.e. I have 10 files present in LND folder and the same 10 files present in WIP folder. My query is that how i can compare the data of the files in these directories. Kindly Help . Thanks
If your files are textual, what about using diff function?

Resources