Comparing Data in Files present in two different Directories - shell

I want to compare the data present in the same files with the same name present in two different directories i.e. I have 10 files present in LND folder and the same 10 files present in WIP folder. My query is that how i can compare the data of the files in these directories. Kindly Help . Thanks

If your files are textual, what about using diff function?

Related

How do I add multiple csv files to the catalog in kedro

I have 4 csv files in Azure blob storage, with same metadata, that i want to process. How can i add them to the datacatalog with a single name in Kedro.
I checked this question
https://stackoverflow.com/questions/61645397/how-do-i-add-many-csv-files-to-the-catalog-in-kedro
But this seems to load all the files in the given folder.
But my requirement is to read only given 4 from many files in the azure container.
Example:
I have many files in azure container in which are 4 transaction csv files with names sales_<date_from>_<date_to>.csv, i want to load these 4 transaction csv files into kedro datacatalog under one dataset.
For starters, PartitionedDataSet is lazy, meaning that files are not actually loaded until you explicitly call that function. Even if you have 100 CSV files that get picked up by the PartitionedDataSet, you can select the partitions that you actually load/work with.
Second, what distinguishes these 4 files from the others? If they have a unique suffix, you can use the filename_suffix option to just select them. For example, if you have:
file_i_dont_care_about.csv
first_file_i_care_about.csv
second_file_i_care_about.csv
third_file_i_care_about.csv
fourth_file_i_care_about.csv
you can specify filepath_suffix: _file_i_care_about.csv.
Don’t think there’s a direct way to do this , you can add another subdirectory inside the blob storage with the 4 files and then use
my_partitioned_dataset:
type: "PartitionedDataSet"
path: "data/01_raw/subdirectory/"
dataset: "pandas.CSVDataSet"
Or in case the requirement of using only 4 files is not going to change anytime soon ,you might as well pass 4 files in the catalog.yml separately to avoid over engineering it.

Nifi Moving files on HDFS path for the previous day merged json file

I have a requirement to move the previous day processed and merged json files into new hdfs path. The requirement is to recursively search unprocessed files and move the pending unprocessed files.
Path 1 -> /data/nifi/working/2019/10/source_2019_10_15.json --- Daily processed files are merged under this path and gets added on daily basis.
Path 2 -> /data/nifi/incoming/ -- The code should search if folders doesn’t exist then create and move the files are just move the files if the folders are already present.
Currently, I am using nifi flow -- ListHDFS->MoveHDFS but unable to achieve it.
Need help how this can be achieved.
Thank you for the help.
The current flow worked fine.
listhdfs->fethchdfs->updateattribute->puthdfs
In the listhdfs, set the minimum file age wait time before consuming. This will allow the process to search files recursively and also use updateattribute to re-create the folder as and process file into /data/nifi/incoming/.

Can I combine data present in multiple files stored in different folders using powerquery?

I want to combine data which is present in different files which are in turn stored in different folders. I've tried all I can but it seems I can only combine data from files present in a single folder. Is there any other way through which I can accomplish this?
You may create two queries (for each folder) and then combine them:
= Folder.Files("C:\folder1") & Folder.Files("C:\folder2")

How to merge HDFS small files into a one large file?

I have number of small files generated from Kafka stream so I like merge small files to one single file but this merge is based on the date i.e. the original folder may have number of previous files but I only like to merge for given date files to one single file.
Any suggestions?
Use something like the code below to iterate over the smaller files and aggregate them into a big one (assuming that source contains the HDFS path to your smaller files, and target is the path where you want your big result file):
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(source)).map(_.getPath.toUri.getPath).
foreach(name => spark.read.text(name).coalesce(1).write.mode(Append).text(target))
This example assumes text file format, but you can just as well read any Spark-supported format, and you can use different formats for source and target, as well
you should be able to use .repartition(1) to write all results to 1 file. if you need to split by date, consider partitionBy("your_date_value") .
if you're working within HDFS and S3, this may also be helpful. you might actually even use s3-dist-cp and stay within HDFS.
https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5
There's a specific option to aggregate multiple files in HDFS using a --groupBy option based n a regular expression pattern. So if the date is in the file name, you can group based on that pattern.
You can develop a spark application. Using this application read the data from small files and create dataframe and write dataframe to big file in append mode.

i have a csv file with locations i need to move s3 files to new locations

I am interested in loading my data into AWS ATHENA DB
my data is compartmentalized by source_video, and in each we have 11 csv files that represent 11 tables referencing this data
ATHENA wants to load by table and not by source_video
for this i have to move these files to folders based on table name and not source_video.
I am fluent in python and bash
i know how to use the aws cli
i wish to know if there is maybe an easier way than running 4Million+ mv commands and executing them in different processes in parallel on several machines
I have a csv file that has locations of files located as children of the source_video they were created for:
I have 400,000+ source_video locations
I have 11 files in each source_video location
i.e.
+source_video1
- 11 files by type
+source_video2
- 11 files by type
+source_video3
- 11 files by type
.
.
+source_video400,000+
- 11 files by type
I wish to move them to 11 folders with 400,000+ files in each folder type
fields: videoName, CClocation, identityLocation, TAGTAskslocation, M2Location
and other locations ....
Below is an example of 2 rows of data:
pj1/09/11/09/S1/S1_IBM2MP_0353_00070280_DVR1.avi,
S1_IBM2MP_0353_00070280_DVR1.avi,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsidentities.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsTAGtasks.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_AL1CCs2data.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_AL1CCs2extendeddata.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_AL1CCsdata.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsALGsruns.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_Cg1CCsdata.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_Cg1CCfeat.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsidentitiestaggers.csv
pj1/09/11/09/S1/S1_IBM2MP_0443_00070380_DVR1.avi,
S1_IBM2MP_0443_00070380_DVR1.avi,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsidentities.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsTAGtasks.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_AL1CCs2data.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_AL1CCs2extendeddata.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_AL1CCsdata.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsALGsruns.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_Cg1CCsdata.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_Cg1CCfeat.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsidentitiestaggers.csv
You are correct. Athena expects all files related to one table to be located in one directory, or in subdirectories of one directory.
Given that you are going to touch so many files, you could choose to process the files rather than simply moving them. For example, putting the contents of several files into a smaller number of files. You could also consider Zipping the files because this would cost you less to scan (Athena is charged based upon data read from disk -- zip files read less data and therefore cost less).
See: Analyzing Data in S3 using Amazon Athena
This type of processing could be done efficiently on an Amazon EMR cluster that runs Hadoop, but some specialist knowledge is required to run Hadoop so it might be easier to use the coding with which you are familiar (eg Python).

Resources