Combine csv files with different structures over time

Combine csv files with different structures over time - powerquery

I am here to ask you a hypothetical question.
Part of my current job consists of creating and updating dashboards. Most dashboards have to be updated everyday.
I've created a PowerBI dashboard from data linked to a folder filled with csv files. I did some queries to edit some things. So, everyday, I download a csv file from a client's web application and add the said file to the linked folder, everything gets updated automatically and all the queries created are applied.
Hypothetical scenario: my client changes the csv structure (e.g. column order, a few column name). How can I deal with this so I can keep my merged csv files table updated?
My guess would be to put the files with the new structure in a different folder, apply new queries so the table structures match, then append queries so I have a single table of data.
Is there a better way?
Thanks in advance.

Say I have some CSVs (all in the same folder) that I need to append/combine into a single Excel table, but:
the column order varies in some CSVs,
and the headers in some CSVs are different (for whatever reason) and need changing/renaming.
First CSV:
a,c,e,d,b
1,1,1,1,1
2,2,2,2,2
3,3,3,3,3
Second CSV:
ALPHA,b,c,d,e
4,4,4,4,4
5,5,5,5,5
6,6,6,6,6
Third CSV:
a,b,charlie,d,e
7,7,7,7,7
8,8,8,8,8
9,9,9,9,9
10,10,10,10,10
If the parent folder (containing my CSVs) is at "C:\Users\user\Desktop\changing csvs\csvs", then this M code should help me achieve what I need:
let
renameMap = [ALPHA = "a", charlie = "c"],
filesInFolder = Folder.Files("C:\Users\user\Desktop\changing csvs\csvs"),
binaryToCSV = Table.AddColumn(filesInFolder, "CSVs", each
let
csv = Csv.Document([Content], [Delimiter = ",", Encoding = 65001, QuoteStyle = QuoteStyle.Csv]),
promoteHeaders = Table.PromoteHeaders(csv, [PromoteAllScalars = true]),
headers = Table.ColumnNames(promoteHeaders),
newHeaders = List.Transform(headers, each Record.FieldOrDefault(renameMap, _, _)),
renameHeaders = Table.RenameColumns(promoteHeaders, List.Zip({headers, newHeaders}))
in
renameHeaders
),
append = Table.Combine(binaryToCSV[CSVs])
in
append
You'd need to change the folder path in the code to whatever it is on your system.
Regarding this line renameMap = [ALPHA = "a", charlie = "c"],, I needed to change "ALPHA" to "a" and "charlie" to "c" in my case, but you'd need to replace with whatever columns need renaming in your case. (Add however many headers you need to rename.)
This line append = Table.Combine(binaryToCSV[CSVs]) will append the tables to one another (to give you one table). It should automatically handle differences in column order. If there any rogue columns (e.g. say there was a column f in one of my CSVs that I didn't notice), then my final table will contain a column f, albeit with some nulls/blanks -- which is why it's important all renaming has been done before that line.
Once combined, you can obviously do whatever else needs doing to the table.
Try it to see if it works in your case.

Related

Combining multiple sheets with different columns using Power Query

I have a workbook with multiple pages that need to get combined, i.e. stacked, into one table. While they have many similar column names, they do not all have the same columns and the column order differs. Because of this I cannot use the inherent merge functionality because it uses column order. Table.Combine will solve the problem, but I cannot figure out to create a statement that will use the "each" mechanic to do that.
For each worksheet in x workbook
Table.Combine(prior sheet, next sheet)
return all sheets stacked.
Would someone please help?

If you load your workbook with Excel.Workbook you can choose the Sheet Kind (instead of Table or DefinedName kinds) and ignore the sheet names.
let
Source = Excel.Workbook(File.Contents("C:\Path\To\File\FileName.xlsx"), null, true),
#"Filter Sheets" = Table.SelectRows(Source, each [Kind] = "Sheet"),
#"Promote Headers" = Table.TransformColumns(#"Filter Sheets", {{"Data", each Table.PromoteHeaders(_, [PromoteAllScalars=true])}}),
#"Combine Sheets" = Table.Combine(#"Promote Headers"[Data])
in
#"Combine Sheets"

Load each table into Power Query as a separate query
fix up the column names as needed for each individual query
save each query as a connection
in one of the queries (or in a separate query) use the Append command to append all the fixed up queries that now have the same column names.

How can I change the name of a column in a parquet file using Pyarrow?

I have several hundred parquet files created with PyArrow. Some of those files, however, have a field/column with a slightly different name (we'll call it Orange) than the original column (call it Sporange), because one used a variant of the query. Otherwise, the data (all the other fields, and all the data) is identical. In a database world, I'd do an ALTER TABLE and rename the column. However, I don't know how to do that with parquet/PyArrow
Is there a way to rename the column in the file, rather than having to regenerate or duplicate the file?
Alternatively, can I read it (read_table or ParquetFile,I assume), change the column in the object (unsure how to do that) and write it out?
I see "rename_columns", but unsure how that works; I tried just using it by itself, it says "rename_columns is not defined".
rename_columns(self, names) Create new table with columns renamed to provided names.
Many thanks!

I suspect you are using a version of pyarrow that doesn't support rename_columns. Can you run pa.__version__ to check?
Otherwise what you want to do is straightforward, in the example below I rename column b to c:
import pyarrow as pa
import pyarrow.parquet as pq
col_a = pa.array([1, 2, 3], pa.int32())
col_b = pa.array(["X", "Y", "Z"], pa.string())
table = pa.Table.from_arrays(
[col_a, col_b],
schema=pa.schema([
pa.field('a', col_a.type),
pa.field('b', col_b.type),
])
)
pq.write_table(table, '/tmp/original')
original = pq.read_table('/tmp/original')
renamed = original.rename_columns(['a', 'c'])
pq.write_table(renamed, '/tmp/renamed')

Working with duplicated columns in SparkR

I am working on a problem where I need to load a large number of CSVs and do some aggregations on them with SparkR.
I need to infer the schema whenever I can (so detect integers etc).
I need to assume that I can't hard-code the schema (unknown number of
columns in each file or can't infer schema from column name alone).
I can't infer the schema from a CSV file with a duplicated header value - it simply won't let you.
I load them like so:
df1 <- read.df(sqlContext, file, "com.databricks.spark.csv", header = "true", delimiter = ",")
It loads OK, but when I try to run any sort of job (even a simple count()) it fails:
java.lang.IllegalArgumentException: The header contains a duplicate entry: # etc
I tried renaming the headers in the schema with:
new <- make.unique(c(names(df1)), sep = "_")
names(df1) <- new
schema(df1) # new column names present in schema
But when I try count() again, I get the same duplicate error as before, which suggests it refers back to the old column names.
I feel like there is a really easy way, apologies in advance if there is. Any suggestions?

the spark csv package doesn't seem to currently have a way to skip lines by index, and if you don't use header="true", your header with dupes will become the first line, and this will mess with your schema inference. If you happen to know what character your header with dupes starts with, and know that no other line will start with that, you can put that in for the comment character setting and that line will get skipped. eg.
df <- read.df(sqlContext, "cars.csv", "com.databricks.spark.csv",header="false", comment="y",delimiter=",",nullValue="NA",mode="DROPMALFORMED",inferSchema="true"‌)

Storing data with big and dynamic groupings / paths

I currently have the following pig script (column list truncated for brevity):
REGISTER /usr/lib/pig/piggybank.jar;
inputData = LOAD '/data/$date*.{bz2,bz,gz}' USING PigStorage('\\x7F')
PigStorage('\\x7F')
AS (
SITE_ID_COL :int,-- = Item Site ID
META_ID_COL :int,-- = Top Level (meta) category ID
EXTRACT_DATE_COL :chararray,-- = Date for the data points
...
)
SPLIT inputData INTO site0 IF (SITE_ID_COL == 0), site3 IF (SITE_ID_COL == 3), site15 IF (SITE_ID_COL == 15);
STORE site0 INTO 'pigsplit1/0/' USING org.apache.pig.piggybank.storage.MultiStorage('pigsplit1/0/','2', 'bz2', '\\x7F');
STORE site3 INTO 'pigsplit1/3/' USING org.apache.pig.piggybank.storage.MultiStorage('pigsplit1/3/','2', 'bz2', '\\x7F');
STORE site15 INTO 'pigsplit1/15/' USING org.apache.pig.piggybank.storage.MultiStorage('pigsplit1/15/','2', 'bz2', '\\x7F');
And it works great for what I wanted it to do, but there's actually at least 22 possible site IDs and I'm not certain there's not more. I'd like to dynamically create the splits and store into paths based on that column. Is the easiest way to do this going to be through a two step usage of the MultiStorage UDF, first splitting by the site ID and then loading all those results and splitting by the date? That seems inefficient. Can I somehow do it through GROUP BYs? It seems like I should be able to GROUP BY the site ID, then flatten each row and run the multi storage on that, but I'm not sure how to concatenate the GROUP into the path.

The MultiStorage UDF is not set up to divide inputs on two different fields, but that's essentially what you're doing -- the use of SPLIT is just to emulate MultiStorage with two parameters. In that case, I'd recommend the following:
REGISTER /usr/lib/pig/piggybank.jar;
inputData = LOAD '/data/$date*.{bz2,bz,gz}' USING PigStorage('\\x7F')
AS (
SITE_ID_COL :int,-- = Item Site ID
META_ID_COL :int,-- = Top Level (meta) category ID
EXTRACT_DATE_COL :chararray,-- = Date for the data points
...
)
dataWithKey = FOREACH inputData GENERATE CONCAT(CONCAT(SITE_ID_COL, '-'), EXTRACT_DATE_COL), *;
STORE dataWithKey INTO 'tmpDir' USING org.apache.pig.piggybank.storage.MultiStorage('tmpDir', '0', 'bz2', '\\x7F');
Then go over your output with a simple script to list all the files in your output directories, extract the site and date IDs, and move them to appropriate locations with whatever structure you like.
Not the most elegant workaround, but it could work all right for you. One thing to watch out for is the separator you choose in your key might not be allowed (it might only be alphanumeric). Also, you'll be stuck with that extra field in your output data.

I've actually submitted a patch to the MultiStorage module to allow splitting on multiple tuple fields rather than only one, resulting in a dynamic output tree.
https://issues.apache.org/jira/browse/PIG-3258
It hasn't gotten much attention yet, but I'm using it in production with no issues.

algorithm - how do I use data from one file as criteria for search/delete from a different file?

I have two different pipe-delimited data files. One is larger than the other. I'm trying to selectively remove data from the large file (we'll call it file A), based on the data contained in the small file (file B). File A contains all of the data, and file B contains only a portion of the data from file A.
I want a function or existing program that removes all of the data contained within file B from file A. I had in mind a function like this:
Pseudo-code:
while !eof(fileB) {
criteria = readLine(fileB);
lineToRemove = searchForLine(criteria, fileA);
deleteLine(lineToRemove, fileA);
}
However, that solution seems very inefficient to me. File A has 23,000 lines in it, and file B has 17,000. And the data contained within file B is literally scattered throughout file A.
If there is a program that can do this, I'd prefer it over code. I'm not picky about the code either. C++ is my strong language, but this data file is going to get converted into a SQL database in the near future so I'm good with SQL/PHP code as well.

Load the two tables into SQL, whatever the database. Doing this sort of manipulation is what databases are designed for. Then you can execute the command:
delete from A
where A.criteria = (select B.criteria from B)
However, I would put the data into Staging tables, and then create and populate the data that I want in SQL. Something like:
create table A ( . . . )
insert into A
select *
from StagingA
where A.criteria not in (select B.criteria from StagingB)
(Here I've used "*" and an insert without a column list. In practice, you should have the list of columns.)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Combine csv files with different structures over time - powerquery

Related

Combining multiple sheets with different columns using Power Query

How can I change the name of a column in a parquet file using Pyarrow?

Working with duplicated columns in SparkR

Storing data with big and dynamic groupings / paths

algorithm - how do I use data from one file as criteria for search/delete from a different file?

Categories

Resources