How can I change the name of a column in a parquet file using Pyarrow? - parquet

I have several hundred parquet files created with PyArrow. Some of those files, however, have a field/column with a slightly different name (we'll call it Orange) than the original column (call it Sporange), because one used a variant of the query. Otherwise, the data (all the other fields, and all the data) is identical. In a database world, I'd do an ALTER TABLE and rename the column. However, I don't know how to do that with parquet/PyArrow
Is there a way to rename the column in the file, rather than having to regenerate or duplicate the file?
Alternatively, can I read it (read_table or ParquetFile,I assume), change the column in the object (unsure how to do that) and write it out?
I see "rename_columns", but unsure how that works; I tried just using it by itself, it says "rename_columns is not defined".
rename_columns(self, names) Create new table with columns renamed to provided names.
Many thanks!

I suspect you are using a version of pyarrow that doesn't support rename_columns. Can you run pa.__version__ to check?
Otherwise what you want to do is straightforward, in the example below I rename column b to c:
import pyarrow as pa
import pyarrow.parquet as pq
col_a = pa.array([1, 2, 3], pa.int32())
col_b = pa.array(["X", "Y", "Z"], pa.string())
table = pa.Table.from_arrays(
[col_a, col_b],
schema=pa.schema([
pa.field('a', col_a.type),
pa.field('b', col_b.type),
])
)
pq.write_table(table, '/tmp/original')
original = pq.read_table('/tmp/original')
renamed = original.rename_columns(['a', 'c'])
pq.write_table(renamed, '/tmp/renamed')

Related

Combine Tables matching a name pattern in Power Query

I am trying to combine many tables that has a name that matches a patterns.
So far, I have extracted the table names from #shared and have the table names in a list.
What I haven't being able to do is to loop this list and transform in a table list that can be combined.
e.g. Name is the list with the table names:
Source = Table.Combine( { List.Transform(Name, each #shared[_] )} )
The error is:
Expression.Error: We cannot convert a value of type List to type Text.
Details:
Value=[List]
Type=[Type]
I have tried many ways but I am missing some kind of type transformation.
I was able to transform this list of tables names to a list of tables with:
T1 = List.Transform(Name, each Expression.Evaluate(_, #shared))
However, the Expression.Evaluate feels like an ugly hack. Is there a better way for this transformation?
With this list of tables, I tried to combine them with:
Source = Table.Combine(T1)
But I got the error:
Expression.Error: A cyclic reference was encountered during evaluation.
If I extract the table from the list with the index (e.g T1{2}) it works. So in this line of thinking, I would need some kind o loop to append.
Steps illustrating the problem.
The objective is to append (Tables.Combine) every table named T_\d+_Mov:
After filtering the matching table names in a table:
Converted to a List:
Converted the names in the list to the real tables:
Now I just need to combine them, and this is where I am stuck.
It is important to not that I don't want to use VBA for this.
It is easier to recreate the query from VBA scanning the ThisWorkbook.Queries() but it would not be a clean reload when adding removing tables.
The final solution as suggested by #Michal Palko was:
CT1 = Table.FromList(T1, Splitter.SplitByNothing(), {"Name"}, null, ExtraValues.Ignore),
EC1 = Table.ExpandTableColumn(CT1, "Name", Table.ColumnNames(CT1{0}[Name]) )
where T1 was the previous step.
The only caveat is that the first table must have all columns or they will be skiped.
I think there might be easier way but given your approach try to convert your list to table (column) and then expand that column:
Alternatively use Table.Combine(YourList)

Infomatica Reading From Metadata

I have a metadata Name as CONTACTS(SOURCE.CSV|TAGET.CSV). Now I read this file using reader and populate the value in table that I created as CONTACT_TABLE(PK NUMBER, Source_name varchar2(500),target_name varchar2(500)) after that I want to read these source.csv and target.csv file stored in my table CONTACT_TABLE AND populate the value in other table called SOURCE_COLUMN_TARGET_COLUMN_TABLE(PK,FK as pk of contact_table,source_column,target_column) this table should contain all the column of source and target and should have one to one relationship with that, for example, source.csv(fn)-----target.csv(firstName)
My objective is whenever we add some other attribute in source or target I should not change the entire mapping for eg if we add source.csv(email) and target.csv(email) it should directly map
Thanks!
please help!
I have this task completed before Friday and I searched every source I found dynamic mapping thing and parameter thing but it was not very helpful I want to do this way itself
Not clear what you are asking actually. The source analyser uses source files(.csv) on import itself and thereby contains the same format in source qualifier.
So, if any of the values gets added into your existing files (source.csv, target.csv) then it becomes a new file for your existing mapping. hence, you dont need to change the whole mapping just that you need to import it again.

Combine csv files with different structures over time

I am here to ask you a hypothetical question.
Part of my current job consists of creating and updating dashboards. Most dashboards have to be updated everyday.
I've created a PowerBI dashboard from data linked to a folder filled with csv files. I did some queries to edit some things. So, everyday, I download a csv file from a client's web application and add the said file to the linked folder, everything gets updated automatically and all the queries created are applied.
Hypothetical scenario: my client changes the csv structure (e.g. column order, a few column name). How can I deal with this so I can keep my merged csv files table updated?
My guess would be to put the files with the new structure in a different folder, apply new queries so the table structures match, then append queries so I have a single table of data.
Is there a better way?
Thanks in advance.
Say I have some CSVs (all in the same folder) that I need to append/combine into a single Excel table, but:
the column order varies in some CSVs,
and the headers in some CSVs are different (for whatever reason) and need changing/renaming.
First CSV:
a,c,e,d,b
1,1,1,1,1
2,2,2,2,2
3,3,3,3,3
Second CSV:
ALPHA,b,c,d,e
4,4,4,4,4
5,5,5,5,5
6,6,6,6,6
Third CSV:
a,b,charlie,d,e
7,7,7,7,7
8,8,8,8,8
9,9,9,9,9
10,10,10,10,10
If the parent folder (containing my CSVs) is at "C:\Users\user\Desktop\changing csvs\csvs", then this M code should help me achieve what I need:
let
renameMap = [ALPHA = "a", charlie = "c"],
filesInFolder = Folder.Files("C:\Users\user\Desktop\changing csvs\csvs"),
binaryToCSV = Table.AddColumn(filesInFolder, "CSVs", each
let
csv = Csv.Document([Content], [Delimiter = ",", Encoding = 65001, QuoteStyle = QuoteStyle.Csv]),
promoteHeaders = Table.PromoteHeaders(csv, [PromoteAllScalars = true]),
headers = Table.ColumnNames(promoteHeaders),
newHeaders = List.Transform(headers, each Record.FieldOrDefault(renameMap, _, _)),
renameHeaders = Table.RenameColumns(promoteHeaders, List.Zip({headers, newHeaders}))
in
renameHeaders
),
append = Table.Combine(binaryToCSV[CSVs])
in
append
You'd need to change the folder path in the code to whatever it is on your system.
Regarding this line renameMap = [ALPHA = "a", charlie = "c"],, I needed to change "ALPHA" to "a" and "charlie" to "c" in my case, but you'd need to replace with whatever columns need renaming in your case. (Add however many headers you need to rename.)
This line append = Table.Combine(binaryToCSV[CSVs]) will append the tables to one another (to give you one table). It should automatically handle differences in column order. If there any rogue columns (e.g. say there was a column f in one of my CSVs that I didn't notice), then my final table will contain a column f, albeit with some nulls/blanks -- which is why it's important all renaming has been done before that line.
Once combined, you can obviously do whatever else needs doing to the table.
Try it to see if it works in your case.

Working with duplicated columns in SparkR

I am working on a problem where I need to load a large number of CSVs and do some aggregations on them with SparkR.
I need to infer the schema whenever I can (so detect integers etc).
I need to assume that I can't hard-code the schema (unknown number of
columns in each file or can't infer schema from column name alone).
I can't infer the schema from a CSV file with a duplicated header value - it simply won't let you.
I load them like so:
df1 <- read.df(sqlContext, file, "com.databricks.spark.csv", header = "true", delimiter = ",")
It loads OK, but when I try to run any sort of job (even a simple count()) it fails:
java.lang.IllegalArgumentException: The header contains a duplicate entry: # etc
I tried renaming the headers in the schema with:
new <- make.unique(c(names(df1)), sep = "_")
names(df1) <- new
schema(df1) # new column names present in schema
But when I try count() again, I get the same duplicate error as before, which suggests it refers back to the old column names.
I feel like there is a really easy way, apologies in advance if there is. Any suggestions?
the spark csv package doesn't seem to currently have a way to skip lines by index, and if you don't use header="true", your header with dupes will become the first line, and this will mess with your schema inference. If you happen to know what character your header with dupes starts with, and know that no other line will start with that, you can put that in for the comment character setting and that line will get skipped. eg.
df <- read.df(sqlContext, "cars.csv", "com.databricks.spark.csv",header="false", comment="y",delimiter=",",nullValue="NA",mode="DROPMALFORMED",inferSchema="true"‌​)

algorithm - how do I use data from one file as criteria for search/delete from a different file?

I have two different pipe-delimited data files. One is larger than the other. I'm trying to selectively remove data from the large file (we'll call it file A), based on the data contained in the small file (file B). File A contains all of the data, and file B contains only a portion of the data from file A.
I want a function or existing program that removes all of the data contained within file B from file A. I had in mind a function like this:
Pseudo-code:
while !eof(fileB) {
criteria = readLine(fileB);
lineToRemove = searchForLine(criteria, fileA);
deleteLine(lineToRemove, fileA);
}
However, that solution seems very inefficient to me. File A has 23,000 lines in it, and file B has 17,000. And the data contained within file B is literally scattered throughout file A.
If there is a program that can do this, I'd prefer it over code. I'm not picky about the code either. C++ is my strong language, but this data file is going to get converted into a SQL database in the near future so I'm good with SQL/PHP code as well.
Load the two tables into SQL, whatever the database. Doing this sort of manipulation is what databases are designed for. Then you can execute the command:
delete from A
where A.criteria = (select B.criteria from B)
However, I would put the data into Staging tables, and then create and populate the data that I want in SQL. Something like:
create table A ( . . . )
insert into A
select *
from StagingA
where A.criteria not in (select B.criteria from StagingB)
(Here I've used "*" and an insert without a column list. In practice, you should have the list of columns.)

Resources