Working with duplicated columns in SparkR - sparkr

I am working on a problem where I need to load a large number of CSVs and do some aggregations on them with SparkR.
I need to infer the schema whenever I can (so detect integers etc).
I need to assume that I can't hard-code the schema (unknown number of
columns in each file or can't infer schema from column name alone).
I can't infer the schema from a CSV file with a duplicated header value - it simply won't let you.
I load them like so:
df1 <- read.df(sqlContext, file, "com.databricks.spark.csv", header = "true", delimiter = ",")
It loads OK, but when I try to run any sort of job (even a simple count()) it fails:
java.lang.IllegalArgumentException: The header contains a duplicate entry: # etc
I tried renaming the headers in the schema with:
new <- make.unique(c(names(df1)), sep = "_")
names(df1) <- new
schema(df1) # new column names present in schema
But when I try count() again, I get the same duplicate error as before, which suggests it refers back to the old column names.
I feel like there is a really easy way, apologies in advance if there is. Any suggestions?

the spark csv package doesn't seem to currently have a way to skip lines by index, and if you don't use header="true", your header with dupes will become the first line, and this will mess with your schema inference. If you happen to know what character your header with dupes starts with, and know that no other line will start with that, you can put that in for the comment character setting and that line will get skipped. eg.
df <- read.df(sqlContext, "cars.csv", "com.databricks.spark.csv",header="false", comment="y",delimiter=",",nullValue="NA",mode="DROPMALFORMED",inferSchema="true"‌​)

Related

data factory special character in column headers

I have a file I am reading into a blob via datafactory.
Its formatted in excel. Some of the column headers have special characters and spaces which isn't good if want to take it to csv or parquet and then SQL.
Is there a way to correct this in the pipeline?
Example
"Activations in last 15 seconds high+Low" "first entry speed (serial T/a)"
Thanks
Normally, Data Flow can handle this for you by adding a Select transformation with a Rule:
Uncheck "Auto mapping".
Click "+ Add mapping"
For the column name, enter "true()" to process all columns.
Enter an appropriate expression to rename the columns. This example uses regular expressions to remove any character that is not a letter.
SPECIAL CASE
There may be an issue with this is the column name contains forward slashes ("/"). I accidentally came across this in my testing:
Every one of the columns not mapped contains forward slashes. Unfortunately, I cannot explain why this would be the case as Data Flow is clearly aware of the column name. It can be addressed manually by adding a Fixed rule for EACH offending column, which is obviously less than ideal:
ANOTHER OPTION
The other thing you could try is to pre-process the text file with another Data Flow using a Source dataset that has no delimiters. This would give you the contents of each row as a single column. If you could get a handle on the just first row, you could remove the special characters.

Power Query – File names loaded from folder become column names, causing failure if new files are later loaded

Power Query sourcing multiple Excel files from a folder.
Files are monthly transactions. The month and year are part of the file names. When the next month comes, new files (in the same format of course, but with new file names) replace the previous ones in the source folder. Having the new file names causes the query to fail on refresh in the following way.
When the files are combined and displayed to begin the transformations, the files names constitute a column of data (named Source). One of my steps in transforming the data is to “use first row as headers”; at this point the first file name in that Source column becomes its column header name.
The problem is that when files having new names replace the previous ones, that column name is no longer found, since the row promoted to be the column header is the name of a new file. PQ is looking for a column header having the original file name and doesn’t find it, so subsequent transformations using that column cause errors.
The error message is: “[Expression.Error] The column ‘[OriginalFileName]’ of the table wasn’t found.”
Basically, that original file name takes on a permanent role as a column name that is part of the query.
I successfully managed to get around the problem by manually renaming all the columns instead of promoting the first data row to be the column headers. Now files with new names are processed without complaint. But this solution is clunky and I would like to keep the step of promoting the first row to be the header.
Does anyone know how to overcome this problem?

Infomatica Reading From Metadata

I have a metadata Name as CONTACTS(SOURCE.CSV|TAGET.CSV). Now I read this file using reader and populate the value in table that I created as CONTACT_TABLE(PK NUMBER, Source_name varchar2(500),target_name varchar2(500)) after that I want to read these source.csv and target.csv file stored in my table CONTACT_TABLE AND populate the value in other table called SOURCE_COLUMN_TARGET_COLUMN_TABLE(PK,FK as pk of contact_table,source_column,target_column) this table should contain all the column of source and target and should have one to one relationship with that, for example, source.csv(fn)-----target.csv(firstName)
My objective is whenever we add some other attribute in source or target I should not change the entire mapping for eg if we add source.csv(email) and target.csv(email) it should directly map
Thanks!
please help!
I have this task completed before Friday and I searched every source I found dynamic mapping thing and parameter thing but it was not very helpful I want to do this way itself
Not clear what you are asking actually. The source analyser uses source files(.csv) on import itself and thereby contains the same format in source qualifier.
So, if any of the values gets added into your existing files (source.csv, target.csv) then it becomes a new file for your existing mapping. hence, you dont need to change the whole mapping just that you need to import it again.

Combine csv files with different structures over time

I am here to ask you a hypothetical question.
Part of my current job consists of creating and updating dashboards. Most dashboards have to be updated everyday.
I've created a PowerBI dashboard from data linked to a folder filled with csv files. I did some queries to edit some things. So, everyday, I download a csv file from a client's web application and add the said file to the linked folder, everything gets updated automatically and all the queries created are applied.
Hypothetical scenario: my client changes the csv structure (e.g. column order, a few column name). How can I deal with this so I can keep my merged csv files table updated?
My guess would be to put the files with the new structure in a different folder, apply new queries so the table structures match, then append queries so I have a single table of data.
Is there a better way?
Thanks in advance.
Say I have some CSVs (all in the same folder) that I need to append/combine into a single Excel table, but:
the column order varies in some CSVs,
and the headers in some CSVs are different (for whatever reason) and need changing/renaming.
First CSV:
a,c,e,d,b
1,1,1,1,1
2,2,2,2,2
3,3,3,3,3
Second CSV:
ALPHA,b,c,d,e
4,4,4,4,4
5,5,5,5,5
6,6,6,6,6
Third CSV:
a,b,charlie,d,e
7,7,7,7,7
8,8,8,8,8
9,9,9,9,9
10,10,10,10,10
If the parent folder (containing my CSVs) is at "C:\Users\user\Desktop\changing csvs\csvs", then this M code should help me achieve what I need:
let
renameMap = [ALPHA = "a", charlie = "c"],
filesInFolder = Folder.Files("C:\Users\user\Desktop\changing csvs\csvs"),
binaryToCSV = Table.AddColumn(filesInFolder, "CSVs", each
let
csv = Csv.Document([Content], [Delimiter = ",", Encoding = 65001, QuoteStyle = QuoteStyle.Csv]),
promoteHeaders = Table.PromoteHeaders(csv, [PromoteAllScalars = true]),
headers = Table.ColumnNames(promoteHeaders),
newHeaders = List.Transform(headers, each Record.FieldOrDefault(renameMap, _, _)),
renameHeaders = Table.RenameColumns(promoteHeaders, List.Zip({headers, newHeaders}))
in
renameHeaders
),
append = Table.Combine(binaryToCSV[CSVs])
in
append
You'd need to change the folder path in the code to whatever it is on your system.
Regarding this line renameMap = [ALPHA = "a", charlie = "c"],, I needed to change "ALPHA" to "a" and "charlie" to "c" in my case, but you'd need to replace with whatever columns need renaming in your case. (Add however many headers you need to rename.)
This line append = Table.Combine(binaryToCSV[CSVs]) will append the tables to one another (to give you one table). It should automatically handle differences in column order. If there any rogue columns (e.g. say there was a column f in one of my CSVs that I didn't notice), then my final table will contain a column f, albeit with some nulls/blanks -- which is why it's important all renaming has been done before that line.
Once combined, you can obviously do whatever else needs doing to the table.
Try it to see if it works in your case.

Saving only data sets that are not empty

I have a large dataset where I do data validation using a syntax. For each validation a variable is created and set to 1 if there is a problem with data I need to check out.
For each validation I then create a subset of the data holding only the relevant variables for the relevant cases. Still using the syntax I save these data files in excel in order to do the checks and correct the data (in a database).
Problem is that not all of my 50+ validations detect any problematic data every time I run the check, but 50+ files are saved because I save a file for each validation. I'd like to save the files only if there is data in them.
Current syntax for saving the files is:
DATASET ACTIVATE DataSet1.
DATASET COPY error1.
DATASET ACTIVATE error1.
FILTER OFF.
USE ALL.
SELECT IF (var_error1 = 1).
EXECUTE.
SAVE TRANSLATE OUTFILE='path + '_error1.xlsx'
/TYPE=XLS
/VERSION=12
/MAP
/REPLACE
/FIELDNAMES
/CELLS=VALUES
/KEEP=var1 var2 var3 var4.
This is repeated for each validation. If no case violates the validation for "error1" I will still get an output file (which is empty).
Any way to alter the syntax to only save the data if there are in fact cases that violate the validation?
The following syntax will write a new syntax that will contain the command to save the file to excel - only if there are actual cases in the file. You will run the new syntax every time, but the excel will be created only in relevant cases :
DATASET ACTIVATE DataSet1.
DATASET COPY error1.
DATASET ACTIVATE error1.
FILTER OFF.
USE ALL.
SELECT IF (var_error1 = 1).
EXECUTE.
do if $casenum=1.
write outfile='path\tmp\run error1.sps' /"SAVE TRANSLATE OUTFILE='path\var_error1.xlsx'"
/" /TYPE=XLS /VERSION=12 /MAP /REPLACE /FIELDNAMES /CELLS=VALUES /KEEP=var1 var2 var3 var4.".
end if.
exe.
insert file='path\tmp\run error1.sps'.
Please edit the "path" according to your needs.
Note that the new syntax will be written in all cases, but when there is no data in the file, the syntax will be empty, and so the empty file won't be written to excel.

Resources