Merge CSVs using Python (or Bash) - bash

I have a set of CSV files in a folder and I'd like to merge them in one "super-csv". Some of the columns are available in all files, some not.
Fields in the output should just be empty, if it was not available in the source. If a columnname is the same over multiple CSV, it should fill the existing column (Name in the example)
File1.CSV
ID Name ContactNo
53 Vikas 9874563210
File2.CSV
ID Name Designation
23 MyShore Software Engineer
Output Expected
ID Name ContactNo Designation
53 Vikas 9874563210
23 MyShore Software Engineer
I've already tried other solutions, but they cannot handle empty fields. eg. merge csv files with different column order remove duplicates
Thanks in advance
Michael

In python, you can use the pandas module that allows to fill a dataframe from a csv, merge dataframe and then save the merged dataframe into new csv file.
For example :
import pandas as pd
df1 = pd.DataFrame.from_csv("file1.csv", sep=",")
df2 = pd.DataFrame.from_csv("file2.csv", sep=",")
final_df = df1.reset_index().merge(df2.reset_index(), how="outer").set_index('ID')
final_df.to_csv("result.csv", sep=",")
which would produce
ID,Name,ContactNo,Designation
53,Vikas,9874563210.0,
23,MyShore,,Software Engineer
You would have to play with the sep argument to adapt to your files format.

Related

Joining three csv files with different header in NiFi

I am new to nifi. Currently I have a use case where I have to merge three csv files that I extracted from the XLXS files. I need to merge them on the basis of NPI column in all the three csv file.
files1.xlsx looks like-
NPI FULL_NAME LAST_NAME FIRST_NAME
1003002627 Arora Himanshu Arora Himanshu
1003007204 Arora Vishal Arora Vishal
files2.xlsx looks like-
NPI No Employee Number CHI/SL Acct-Unit
1003002627 147536 5812207
1003007204 185793 5854207
files3.xlsx looks like -
Individual NPI Group NPI Market
1003002627 1396935714 Houston
1003007204 1396935714 Houston
I want to left join on my first csv with the specific column in the other csv file so that my desired output is
NPI Full Name Employee Number Market
1003002627 Arora Himanshu 147536 Houston
1003007204 Arora Vishal 185793 Houston
I tried with the Query Record processor but I don't know how three different schema from three different csv file should I combine together. Any help on this will be highly appreciated.
This is what I have tried.
NiFi doesn't have the ability to do what amounts to a SQL join on three different record sets like this out of the box, but there is a work around here if you know Python:
Zip them up.
Ingest the zip file.
Use PutFile to put the zip to a standard location.
Write a Python script that will unzip the zip for you and do the merging.
Call the Python script immediately after PutFile with the ExecuteStreamCommand processor.
The ExecuteStreamCommand processor will read stdout, so just write your combined output with sys.stdout and it'll get back into NiFi.

How to extract multiple values as multiple column data from filename by Informatica PowerCenter?

I am very new to Informatica PowerCenter, Just started learning. Looking for help. My requirement is : I have to extract data from flat file(CSV file) and store the data into Oracle Table. Some of the column value of the target table should be coming from extracting file name.
For example:
My Target Table is like below:
USER_ID Program_Code Program_Desc Visit Date Term
EACRP00127 ER Special Visits 08/02/2015 Aug 2015
My input filename is: Aug 2015 ER Special Visits EACRP00127.csv
From this FileName I have to extract "AUG 2015" as Term, "ER Special Visits" as Program_Desc and "EACRP00127" as Program_Code along with some other fields from the CSV file.
I have found one solution using "Currently Processed Filename". But with this I am able to get one single value from filename. how can I extract 3 values from the filename and store in the target table? Looking for some shed of light towards solution. Thank you.
Using expression transformation you can create three output values from Currently Processed Filename column.
So you get the file name from SQ using this field 'Currently Processed Filename'. Then you can substring the whole string to get what you want.
input/output = Currently Processed Filename
o_Term = substr(Currently Processed Filename,1,9)
o_Program_Desc = substr(Currently Processed Filename,10,18)
o_Program_Code = substr(Currently Processed Filename,28,11)

How can I change the name of a column in a parquet file using Pyarrow?

I have several hundred parquet files created with PyArrow. Some of those files, however, have a field/column with a slightly different name (we'll call it Orange) than the original column (call it Sporange), because one used a variant of the query. Otherwise, the data (all the other fields, and all the data) is identical. In a database world, I'd do an ALTER TABLE and rename the column. However, I don't know how to do that with parquet/PyArrow
Is there a way to rename the column in the file, rather than having to regenerate or duplicate the file?
Alternatively, can I read it (read_table or ParquetFile,I assume), change the column in the object (unsure how to do that) and write it out?
I see "rename_columns", but unsure how that works; I tried just using it by itself, it says "rename_columns is not defined".
rename_columns(self, names) Create new table with columns renamed to provided names.
Many thanks!
I suspect you are using a version of pyarrow that doesn't support rename_columns. Can you run pa.__version__ to check?
Otherwise what you want to do is straightforward, in the example below I rename column b to c:
import pyarrow as pa
import pyarrow.parquet as pq
col_a = pa.array([1, 2, 3], pa.int32())
col_b = pa.array(["X", "Y", "Z"], pa.string())
table = pa.Table.from_arrays(
[col_a, col_b],
schema=pa.schema([
pa.field('a', col_a.type),
pa.field('b', col_b.type),
])
)
pq.write_table(table, '/tmp/original')
original = pq.read_table('/tmp/original')
renamed = original.rename_columns(['a', 'c'])
pq.write_table(renamed, '/tmp/renamed')

Combine csv files with different structures over time

I am here to ask you a hypothetical question.
Part of my current job consists of creating and updating dashboards. Most dashboards have to be updated everyday.
I've created a PowerBI dashboard from data linked to a folder filled with csv files. I did some queries to edit some things. So, everyday, I download a csv file from a client's web application and add the said file to the linked folder, everything gets updated automatically and all the queries created are applied.
Hypothetical scenario: my client changes the csv structure (e.g. column order, a few column name). How can I deal with this so I can keep my merged csv files table updated?
My guess would be to put the files with the new structure in a different folder, apply new queries so the table structures match, then append queries so I have a single table of data.
Is there a better way?
Thanks in advance.
Say I have some CSVs (all in the same folder) that I need to append/combine into a single Excel table, but:
the column order varies in some CSVs,
and the headers in some CSVs are different (for whatever reason) and need changing/renaming.
First CSV:
a,c,e,d,b
1,1,1,1,1
2,2,2,2,2
3,3,3,3,3
Second CSV:
ALPHA,b,c,d,e
4,4,4,4,4
5,5,5,5,5
6,6,6,6,6
Third CSV:
a,b,charlie,d,e
7,7,7,7,7
8,8,8,8,8
9,9,9,9,9
10,10,10,10,10
If the parent folder (containing my CSVs) is at "C:\Users\user\Desktop\changing csvs\csvs", then this M code should help me achieve what I need:
let
renameMap = [ALPHA = "a", charlie = "c"],
filesInFolder = Folder.Files("C:\Users\user\Desktop\changing csvs\csvs"),
binaryToCSV = Table.AddColumn(filesInFolder, "CSVs", each
let
csv = Csv.Document([Content], [Delimiter = ",", Encoding = 65001, QuoteStyle = QuoteStyle.Csv]),
promoteHeaders = Table.PromoteHeaders(csv, [PromoteAllScalars = true]),
headers = Table.ColumnNames(promoteHeaders),
newHeaders = List.Transform(headers, each Record.FieldOrDefault(renameMap, _, _)),
renameHeaders = Table.RenameColumns(promoteHeaders, List.Zip({headers, newHeaders}))
in
renameHeaders
),
append = Table.Combine(binaryToCSV[CSVs])
in
append
You'd need to change the folder path in the code to whatever it is on your system.
Regarding this line renameMap = [ALPHA = "a", charlie = "c"],, I needed to change "ALPHA" to "a" and "charlie" to "c" in my case, but you'd need to replace with whatever columns need renaming in your case. (Add however many headers you need to rename.)
This line append = Table.Combine(binaryToCSV[CSVs]) will append the tables to one another (to give you one table). It should automatically handle differences in column order. If there any rogue columns (e.g. say there was a column f in one of my CSVs that I didn't notice), then my final table will contain a column f, albeit with some nulls/blanks -- which is why it's important all renaming has been done before that line.
Once combined, you can obviously do whatever else needs doing to the table.
Try it to see if it works in your case.

Python/Pandas - merging one to many csv for denormalization

I have a bunch of large csv files that were extracted out of a relational database. So for example I have customers.csv , address.csv and customer-address.csv that maps the key values for the relationships. I found an answer on how to merge the files here :
Python/Panda - merge csv according to join table/csv
So right now my code looks like this:
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df3 = pd.read_csv(file3)
df = (df3.merge(df1, left_on='CID', right_on='ID')
.merge(df2, left_on='AID', right_on='ID', suffixes=('','_'))
.drop(['CID','AID','ID_'], axis=1))
print (df)
Now I noticed that I have files with a one to many relationship and with the code above pandas is probably overriding values when there are multiple matches for one key.
Is there a method to join files with a one to many (many to many) relationship? I'm thinking of creating a full (redundant) row for each foreign key. So basically denormalization.
The answer to my question is to perform an outer join. With the code below pandas creates a new row for every occurence of one of the id's in the left or right dataframe thus creating a denormalized table.
df1.merge(df2, left_on='CID', right_on='ID', how='outer')

Resources