Combining multiple sheets with different columns using Power Query - powerquery

I have a workbook with multiple pages that need to get combined, i.e. stacked, into one table. While they have many similar column names, they do not all have the same columns and the column order differs. Because of this I cannot use the inherent merge functionality because it uses column order. Table.Combine will solve the problem, but I cannot figure out to create a statement that will use the "each" mechanic to do that.
For each worksheet in x workbook
Table.Combine(prior sheet, next sheet)
return all sheets stacked.
Would someone please help?

If you load your workbook with Excel.Workbook you can choose the Sheet Kind (instead of Table or DefinedName kinds) and ignore the sheet names.
let
Source = Excel.Workbook(File.Contents("C:\Path\To\File\FileName.xlsx"), null, true),
#"Filter Sheets" = Table.SelectRows(Source, each [Kind] = "Sheet"),
#"Promote Headers" = Table.TransformColumns(#"Filter Sheets", {{"Data", each Table.PromoteHeaders(_, [PromoteAllScalars=true])}}),
#"Combine Sheets" = Table.Combine(#"Promote Headers"[Data])
in
#"Combine Sheets"

Load each table into Power Query as a separate query
fix up the column names as needed for each individual query
save each query as a connection
in one of the queries (or in a separate query) use the Append command to append all the fixed up queries that now have the same column names.

Related

How to Combine Excel files that contain valid tables and Remap column Names - Power Query Function

I’m making a Power Query (M Code) that combines all Sheets in Workbooks stored in a folder. The logic is the following
Read folder content
Form a list of Workbooks(Sheets) within
Invoke a Function to “Format” content of each Sheet to append the records in a consolidated table
The invoked Function on every Sheet should:
Identify where the “Titles" Row is located
Remove “n” records until “Titles" Row
Remap “Titles" to a standard Name from HeaderMap table
Reorder the columns according to HeaderMap table
Promote the “Titles” Row to Columns Headers
Change column types according to HeaderMap table
Remove “Blank” records
The caveat is that I may encounter Sheets that have no useful information. I need to peek inside the Sheet to verify if valid content and then execute the function to format. How can I Skip the Sheet when consolidating all tables? Something like
Identify indexed row where the “Titles” are located
If (no valid “Titles” found) then Skip Sheet
else (continue with remaining steps)
Thank you in advance

Datameer - Add Columns to Joined table

I have joined some data from HDFS with some data from an Oracle DW, which is working fine, but I cant seem to add any new columns to this sheet. To add columns for calculated fields etc I have to duplicate the sheet and do it there - this doesn't seem overly efficient.
Am I doing something wrong here or can you not add columns to a join result sheet?
... but I cant seem to add any new columns to this sheet.
Right. It will not be possible to add columns to a JoinedSheet. It is a new data set containing columns from two or more sheets based on a key column which you defined.
... or can you not add columns to a join result sheet?
It will be necessary to reference these data as input for a new Worksheet by Duplicating Worksheet.
Another approach could be using datameer rest-api. You can get the content of the workbook in json format and add columns manually or by implementing a simple script, then update the workbook with changed json file.

Is there a way to dynamically identify and expand an embedded table's columns?

If I want to expand this embedded table...
...and I click on the expand button, I'm presented with the dropdown to select which columns I want to expand:
However, if I choose '(Select All Columns)' to expand them all, Power Query turns that into hard-coded column names of all the columns at the time I do that. Like this:
= Table.ExpandTableColumn(Source, "AllData", {"Column1", "Column2", "Column3", "Column4", "Custom"}, {"Column1", "Column2", "Column3", "Column4", "Custom"})
After that, if the underlying embedded table's columns change, the hard-coded column names will no longer be relevant and the query will "break."
So how can I tell it to dynamically identify and extract all of the current columns of the embedded table?
You can do something like this to get the list of column names:
List.Accumulate(Source[AllData], {}, (state, current) => List.Union({state, Table.ColumnNames(current)}))
This goes through each cell in the column, gets the column names from the table in that cell, and adds the new names to the result. It's easier to store this in a new step and then reference that in your next step.
Keep in mind that this method can be much slower than passing in the list of names you know about because it has to scan through the entire table to get the column names. You may also have problems if you use this for the third parameter in Table.ExpandTableColumn because it could use a column name that already exists.
Try using Table.Join which joins and expands the second table in one step.
"Merged Queries" = Table.Join(Source,{"Index.1"},Table2,{"Index.2"},JoinKind.LeftOuter)
You just need to make sure that the columns between the tables are unique.
Use Table.PrefixColumns to ensure column names are unique

Removing a dynamic list of columns in powerquery

I'm working on a tool to help my team identify changes in some data files. Long story short, i managed to put something together (I'm quite the beginner with powerquery and M) that works well but it lacks user friendliness.
Issue is that not all team members need the tool to check for differences in all columns (different people, different interests). In order to manage this i used the following to remove all the unneeded columns before doing the compare:
= Table.RemoveColumns(myTable,{"col1","col2","col3"... etc
This works but if you want to change the configuration you need to go into the code and modify the list.
My question is the following: Is there any way to integrate a dynamic list into this code? i.e. have that list of columns in an easy to use table, "tick/untick" the ones you want and have the code remove the rest?
If your intent is to allow the user to select columns without entering the query editor then you may benefit from using a parameter table as described here: http://www.excelguru.ca/blog/2014/11/26/building-a-parameter-table-for-power-query/ . You should be able to expose a 2colxNrow table to the user with some predefined column names/numbers. You can use data validation to constrain user inputs to a binary on/off behavior ( https://support.office.com/en-us/article/Apply-data-validation-to-cells-29fecbcc-d1b9-42c1-9d76-eff3ce5f7249 ).
( P.S. Based on the your description of your goals Inquire add-in may alread offer the functionality you are looking for )
Probably the easiest way is to use "Choose Columns" on the Home tab in the Query Editor and then rename the generated step like:
#"CHOOSE COLUMNS HERE ----->" = Table.SelectColumns(Source,{"Column1", "Column2", "Column3", "Column5", "Column7", "Column8", "Column9", "Column10"})
Then when you want to adjust the selected columns, you can press the small wheel to which the arrow is pointing, and a popup will show up from which you can do your (un)ticking.
Alternatively, if you use multiple queries with the same selection, you can create an additional query that outputs a list, like:
let
Source = Table.FromList(List.Transform({1..10}, each "Column" & Text.From(_)),null,{"Available Columns"}),
Transposed = Table.Transpose(Source),
#"CHOOSE COLUMNS HERE ----->" = Table.SelectColumns(Transposed,{"Column2", "Column3", "Column5", "Column6", "Column8", "Column9", "Column10"}),
TransposedBack = Table.Transpose(#"CHOOSE COLUMNS HERE ----->"),
ConvertedToList = TransposedBack[Column1]
in
ConvertedToList
And then use that list in your queries, like:
= Table.SelectColumns(#"Transposed Table",SelectedColumns)
where SelectedColumns is the name of the query with the selected columns.

HappyBase - Is there an equivalent of find_one or scan_one?

All the rows in a particular HBase table that I am making a UI for happen to have the same columns and will have so for the foreseeable future. I would like my html data visualizer application to simply query for a single random row to take note of the column names, and put this list of column names into a variable to refer to throughout the program.
I didn't see any equivalent to find_one or scan_one in the docs for HappyBase.
What is the best way to accomplish this?
This will fetch only the first row:
row = next(table.scan(limit=1))
Additionally you can specify a filter string to avoid retrieving the values, which is only worthwile if your values are large and you're performing this query often.
I use the 'limit' option.
Here is my HappyBase Python code:
pool1 = happybase.ConnectionPool(size=2, host='172.00.00.01')
with pool1.connection() as conn1:
hTable = conn1.table('HBastTableHere')
for rowKey, rowData in hTable.scan( limit=1):
print rowData
You can create a Scanner object, without specifying the start row (so that it start at the first row in the table), and limit the scan to one row. You will get then the first row only.
From the HBase shell command it should look like this:
scan 'table_name', {LIMIT => 1}
I don't know for HappyBase but I think you should be able to do that

Resources