How to convert key/value pairs into columns? - powerquery

AMENDMENT:
I apologize, in trying to simplify my question I failed to mention that there are a bunch of static columns. In fact the majority of the XML is specific tags, and the (Dynamic) Key/Value pair are just a way for developers to extend the schema. Unfortunately, this means that I cannot pivot the entire table!
To demonstrate this, consider the following input:
<?xml version="1.0"?>
<list>
<value type="object">
<customer>186748</customer>
<create_date>2013-01-22</create_date>
<tag0 name="Last_Name" type="object">
<value>Smith</value>
</tag0>
<tag1 name="First_Name" type="object">
<value>Wendy</value>
</tag1>
<tag2 name="Company" type="object">
<value>ACME Inc</value>
</tag2>
</value>
<value type="object">
<customer>256238</customer>
<create_date>2013-01-22</create_date>
<tag0 name="First_Name" type="object">
<value>Bob</value>
</tag0>
<tag1 name="Company" type="object">
<value>ABC Corp</value>
</tag1>
</value>
<value type="object">
<customer>301654</customer>
<create_date>2013-01-22</create_date>
<tag0 name="Company" type="object">
<value>Everything Co</value>
</tag0>
</value>
</list>
Should produce the following output:
I am looking for an approach to solving this type of problem. I am now wondering if I can just pivot part of a table using a custom function.
ORIGINAL POST:
I am looking for a suggested approach in PowerQuery to create columns from dynamic Key/Value pairs. My data source is an XML document with a bunch of generic elements (e.g. tag0, tag1, ..., tagN) that holds a Key (in "name" attribute) and a Value (in "value" element), for example:
<?xml version="1.0"?>
<list>
<value type="object">
<tag0 name="Last_Name" type="object">
<value>Smith</value>
</tag0>
<tag1 name="First_Name" type="object">
<value>Wendy</value>
</tag1>
<tag2 name="Company" type="object">
<value>ACME Inc</value>
</tag2>
</value>
<value type="object">
<tag0 name="First_Name" type="object">
<value>Bob</value>
</tag0>
<tag1 name="Company" type="object">
<value>ABC Corp</value>
</tag1>
</value>
<value type="object">
<tag0 name="Company" type="object">
<value>Everything Co</value>
</tag0>
</value>
</list>
What I want to do is use the Key to create a table column, and assign it the Value (or null, if it doesn't exist). Ideally, the output would look like:
When I load this into PowerQuery, it looks like the following:
My first thought was to expand each of the generic table columns (e.g. tag0, tag1, etc.) to hold the Key and Value, then separate the columns by Key, then merge all of the Keys across the expanded generic tags. By this I mean, start by expanding "tag0", create separate conditional columns for each Key type (e.g "Company0", "First_Name0", "Last_Name0", do the same for "tag1", "tag2" and "tag3", then merge "Company0" (from tag0), "Company1" (from tag1), ..., "Company3" (from tag3), and do the same for "First_Name" and "Last_Name". Trouble with this approach is it is a whole lot effort that does not scale well if another Key is added. Using this approach there are 4 generic tags, with (currently) 5 possible Keys, which means 2 columns for each generic tag for the expansion, plus 20 conditional columns, plus 5 columns for the merge.
The Question: Is there a better approach for converting one to many Key/Value pairs into columns?

Edited to account for amended sample
The below seems to work with your sample data.
Read the code comments and explore the Applied Steps to understand the algorithm.
The approach is to expand and then extract the Values into one column and the corresponding attributes into another column.
By then Pivoting (with no aggregation) on the Attributes column, PQ will create a new column for each attribute.
No custom function required in this version
Main M Code
let
//change next line to reflect actual data source
Source = Xml.Document(File.Contents("C:\Users\ron\Desktop\Sample XML.xml")),
//Expand as needed
#"Removed Columns" = Table.RemoveColumns(Source,{"Name", "Namespace", "Attributes"}),
#"Expanded Value" = Table.ExpandTableColumn(#"Removed Columns", "Value", {"Value"}, {"Value.1"}),
#"Expanded Value.1" = Table.ExpandTableColumn(#"Expanded Value", "Value.1", {"Name", "Value", "Attributes"}, {"Name", "Value", "Attributes"}),
//Extract Attributes and Values to columns
//Extraction method depends on level of object
#"Added Custom" = Table.AddColumn(#"Expanded Value.1", "Name.1", each if Value.Is([Value],type table) then [Value]{0}[Value] else [Value], type text),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "Attribute", each if Value.Is([Value],type table) then [Attributes]{0}[Value] else [Name]),
//remove original columns
#"Removed Columns1" = Table.RemoveColumns(#"Added Custom1",{"Name", "Value", "Attributes"}),
//create "Grouper Column" to group each set of data
//Assumes first entry of each "group" is "customer"
#"Added Index" = Table.AddIndexColumn(#"Removed Columns1", "Index", 0, 1, Int64.Type),
#"Added Custom2" = Table.AddColumn(#"Added Index", "Grouper", each if [Attribute]="customer" then [Index] else null),
#"Removed Columns2" = Table.RemoveColumns(#"Added Custom2",{"Index"}),
#"Filled Down" = Table.FillDown(#"Removed Columns2",{"Grouper"}),
//Group by Grouper, then Pivot each sub table
#"Grouped Rows" = Table.Group(#"Filled Down", {"Grouper"}, {
{"Pivot", each Table.Pivot(Table.RemoveColumns(_, "Grouper"), List.Distinct([Attribute]), "Attribute", "Name.1")}}),
#"Removed Columns3" = Table.RemoveColumns(#"Grouped Rows",{"Grouper"}),
//Expand the pivoted tables and set the correct column order and data types
#"Expanded Pivot" = Table.ExpandTableColumn(#"Removed Columns3", "Pivot", {"customer", "create_date", "Last_Name", "First_Name", "Company"}),
#"Reorder Columns" = Table.ReorderColumns(#"Expanded Pivot", {"customer", "create_date", "First_Name", "Last_Name", "Company"}),
#"Set Data Types" = Table.TransformColumnTypes(#"Reorder Columns",
List.Zip({Table.ColumnNames(#"Reorder Columns"),{type text,type date,type text,type text,type text}}))
in
#"Set Data Types"
Results from amended question

What the heck. Editing answer for revised question
let Source = Xml.Tables(File.Contents("C:\temp\a.xml")),
Table0 = Source{0}[Table],
Expanded=ExpandAll(Table0),
shrunk= Table.RemoveColumns(Expanded,List.Select(Table.ColumnNames(Expanded),each Text.Contains(_,"type"))),
#"Added Index" = Table.AddIndexColumn(shrunk, "Index", 0, 1, Int64.Type),
Base=({"Index","customer","create_date"}), //names of columns to keep
repeating_groups=2, // repeating groups of xx columns
Combo = List.Transform(List.Split(List.Difference(Table.ColumnNames(#"Added Index"),Base),repeating_groups), each Base & _),
#"Added Custom" =List.Accumulate(Combo,#table({"Column1"}, {}),(state,current)=> state & Table.Skip(Table.DemoteHeaders(Table.SelectColumns(#"Added Index", current)),1)),
#"Rename"=Table.RenameColumns(#"Added Custom",List.Zip({List.FirstN(Table.ColumnNames(#"Added Custom"),List.Count(Base)),Base})) ,
#"Filtered Rows" = Table.SelectRows(#"Rename", each Record.Field(_,List.Last(Table.ColumnNames(#"Rename"))) <> null),
#"Pivoted Column" = Table.Pivot(#"Filtered Rows", List.Distinct(#"Filtered Rows"[Column5]), "Column5", "Column4")
in #"Pivoted Column"
// fn = ExpandAll
(TableToExpand as table, optional ColumnNumber as number) =>
//https://blog.crossjoin.co.uk/2014/05/21/expanding-all-columns-in-a-table-in-power-query/
let ActualColumnNumber = if (ColumnNumber=null) then 0 else ColumnNumber,
ColumnName = Table.ColumnNames(TableToExpand){ActualColumnNumber},
ColumnContents = Table.Column(TableToExpand, ColumnName),
ColumnsToExpand = List.Distinct(List.Combine(List.Transform(ColumnContents, each if _ is table then Table.ColumnNames(_) else {}))),
NewColumnNames = List.Transform(ColumnsToExpand, each ColumnName & "." & _),
CanExpandCurrentColumn = List.Count(ColumnsToExpand)>0,
ExpandedTable = if CanExpandCurrentColumn then Table.ExpandTableColumn(TableToExpand, ColumnName, ColumnsToExpand, NewColumnNames) else TableToExpand,
NextColumnNumber = if CanExpandCurrentColumn then ActualColumnNumber else ActualColumnNumber+1,
OutputTable = if NextColumnNumber>(Table.ColumnCount(ExpandedTable)-1) then ExpandedTable else ExpandAll(ExpandedTable, NextColumnNumber)
in OutputTable

Related

Power query grouping

Can Power query do this?
So I have a group of parent IDs. If the parent Ids are the same but the values from the corresponding attributes are different, I want PQ to let me know they can be grouped together.
Here is the example.
So Parent IDs 12345 are the same, and the values are different, I want the output to say SDSKU..Yes Then if the Parent IDs 333 are the same and values are the same, then that will not be a grouping and I want it to say NO. See image link
If you mean by "values" the values of the column "Color", try the M code below :
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Parent ID", Int64.Type}, {"Kitchen Sink", Int64.Type}, {"Color", type text}}),
#"Grouped Rows" = Table.Group(#"Changed Type", {"Parent ID", "Kitchen Sink"}, {{"AllData", each _, type table [Parent ID=nullable number, Kitchen Sink=nullable number, Color=nullable text]}, {"OccuID", each Table.RowCount(_), Int64.Type}}),
#"Added Custom" = Table.AddColumn(#"Grouped Rows", "NumberOfColors", each List.Count(List.Distinct([AllData][Color]))),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "SDSKU", each if [OccuID] = [NumberOfColors] then "Yes" else "No"),
#"Expanded AllData" = Table.ExpandTableColumn(#"Added Custom1", "AllData", {"Kitchen Sink", "Color"}, {"Kitchen Sink.1", "Color"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded AllData",{"OccuID", "NumberOfColors"})
in
#"Removed Columns"
If "attributes" are the value of every column except the one named Parent ID, try the M code below :
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Grouped Rows" = Table.Group(Source , {"Parent ID"}, {
{"data", each _, type table },
{"check", each if Table.RowCount(_) = Table.RowCount(Table.Distinct(_, List.Difference(Table.ColumnNames(_),{"Parent ID"}))) then "YES" else "NO"}}),
#"Expanded data" = Table.ExpandTableColumn(#"Grouped Rows", "data", List.Difference(Table.ColumnNames(Source),{"Parent ID"}), List.Difference(Table.ColumnNames(Source),{"Parent ID"}))
in #"Expanded data"

Powerquery: Remove next n rows after occurence of value in column

I frequently have large datasets in powerquery where I need to remove/filter out the same row, as well as the following 13 whenever a certain value, in this case "Page" occurs. This occurs multiple times throughout the column.
I've tried referring to the next/previous rows by adding an index column and {[Index]+1} shenanigans but that either didn't work or took 15+ minutes to load.
I've tried setting up something with Table.RemoveFirstN(Text.Contains([Column], "Page"), 13) but that just errored out.
Would anyone know how I could filter the row where a value occurs, as well as the next n rows (index?) in Powerquery?
Kind regards,
This seems to work ok
We add an index. Test for "Page". In a new column, if Page is present, copy over the index. Fill down then group on that. Add 2nd index to the grouping. Expand all columns. Filter out anything where 2nd index is <14. Remove extra columns
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Merged Price Country", type text}}),
#"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 1, 1),
#"Added Custom" = Table.AddColumn(#"Added Index", "Custom", each try if Text.Contains([Merged Price Country],"Page") then [Index] else null otherwise null),
#"Filled Down" = Table.FillDown(#"Added Custom",{"Custom"}),
mGroup = Table.Group(#"Filled Down", {"Custom"}, {{"Data", each Table.AddIndexColumn(_, "Index2", 1, 1), type table}}),
#"Removed Columns" = Table.RemoveColumns(mGroup,{"Custom"}),
// expand all columns
List = List.Union(List.Transform(#"Removed Columns"[Data], each Table.ColumnNames(_))),
#"Expanded Data" = Table.ExpandTableColumn(#"Removed Columns", "Data", List,List),
#"Filtered Rows" = Table.SelectRows(#"Expanded Data", each [Custom]=null or [Index2] > 14),
#"Removed Columns1" = Table.RemoveColumns(#"Filtered Rows",{"Index", "Custom", "Index2"})
in #"Removed Columns1"
I skipped out on using Table.RemoveFirstN() on the groupings in code above case there are leading rows you want to keep, but you could use that instead of adding the 2nd index and filtering like below
let Source = Excel.CurrentWorkbook(){[Name="Table3"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Merged Price Country", type text}}),
#"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 1, 1),
#"Added Custom" = Table.AddColumn(#"Added Index", "Custom", each try if Text.Contains([Merged Price Country],"Page") then [Index] else null otherwise null),
#"Filled Down" = Table.FillDown(#"Added Custom",{"Custom"}),
mGroup = Table.Group(#"Filled Down", {"Custom"}, {{"Data", each Table.RemoveFirstN(_, 13), type table}}),
#"Removed Columns" = Table.RemoveColumns(mGroup,{"Custom"}),
// expand all columns
List = List.Union(List.Transform(#"Removed Columns"[Data], each Table.ColumnNames(_))),
#"Expanded Data" = Table.ExpandTableColumn(#"Removed Columns", "Data", List,List),
#"Removed Columns1" = Table.RemoveColumns(#"Expanded Data",{"Index", "Custom"})
in #"Removed Columns1"
Different approach. Wonder which might be faster:
Create a list of rows to be removed (by row number)
Select the rows not in that list
let
Source = Excel.CurrentWorkbook(){[Name="Table12"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Text", type text}, {"Data", Int64.Type}}),
//Add index column
#"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 0, 1, Int64.Type),
//create list rows to be removed
textCol = List.Transform(#"Added Index"[Text], each
if _ = null then null
else if Text.Contains(_,"Page",Comparer.OrdinalIgnoreCase) then "RemoveMe"
else _),
//create list of positions to be removed
removePos = List.Combine(List.Transform(List.PositionOf(textCol,"RemoveMe",Occurrence.All), each {_..List.Min({_+13, List.Count(textCol)})})),
//Filter the table using the "RemoveMe" list
filter = Table.SelectRows(#"Added Index", each not List.Contains(removePos,[Index])),
#"Removed Columns" = Table.RemoveColumns(filter,{"Index"})
in
#"Removed Columns"

How to select certain column in power query

I would like to choose a certain columns in power query, but not using their names. Ex. I can do this in R, by command: select. I'm wondering how i can do it in power query. I found some information here, but not all that I need.
Any idea, if I want to refer to more than one column?
It doesn't work if I write the code as below:
#"Filtered Part Desc" = Table.SelectRows (
#"Removed Columns3",
each List.Contains(
{ "ENG", "TRANS" },
Record.Field(_, Table.ColumnNames(#"Removed Columns3") { 5, 6, 7 })
)
)
Let's say I have this table and want to do a couple of things to it.
First, I want to change the column type of the second and last columns. We can use Table.ColumnNames to do this using simple indexing (which starts at zero) as follows:
Table.TransformColumnTypes(
Source,
{
{Table.ColumnNames(Source){1}, Int64.Type},
{Table.ColumnNames(Source){3}, Int64.Type}
}
)
That works but requires specifying each index separately. If we want to unpivot these columns like this
Table.Unpivot(#"Changed Type", {"Col2", "Col4"}, "Attribute", "Value")
but using the index values instead we can use the same method as above
Table.Unpivot(
#"Changed Type",
{
Table.ColumnNames(Source){1},
Table.ColumnNames(Source){3}
}, "Attribute", "Value"
)
But is there a way to do this where we can use a single list of positional index values and use Table.ColumnNames only once? I found a relatively simple though unintuitive method on this blog. For this case, it works as follows:
Table.Unpivot(
#"Changed Type",
List.Transform({1,3}, each Table.ColumnNames(Source){_}),
"Attribute", "Value"
)
This method starts with the list of positional index values and then transforms them into column names by looking up the names of the columns corresponding to those positions.
Here's the full M code for the query I was playing with:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WSlTSUTIE4nIgtlSK1YlWSgKyjIC4AogtwCLJQJYxEFcCsTlYJAXIMgHiKiA2U4qNBQA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Col1 = _t, Col2 = _t, Col3 = _t, Col4 = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{Table.ColumnNames(Source){1}, Int64.Type},{Table.ColumnNames(Source){3}, Int64.Type}}),
#"Unpivoted Columns" = Table.Unpivot(#"Changed Type", List.Transform({1,3}, each Table.ColumnNames(Source){_}), "Attribute", "Value")
in
#"Unpivoted Columns"

countif formula in power query

I had a formula in a table in excel
=IF([#STATUS]="",[KEY]&"_"&COUNTIF(INDEX([KEY],1):[#KEY],[#KEY]),"")
which showed me how often a value showed in the data. But the same is not working in Power Query
with the formula I use to get if the same value's position in a long data list, and then I use the same in index match formula to find and locate other relevant data
I am trying to achieve:
Date Name Frequency
1/10/2019 Adrian Bartholomeusz 1
1/10/2019 Aditya Tipnis 1
2/10/2019 Abdul Atef 1
2/10/2019 Aditya Tipnis 2
3/10/2019 Abdul Atef 2
In excel I used the formula "=COUNTIF(INDEX([Name],1):[#Name],[#Name])" but when I use the same in Power Query I am getting error
The key steps are:
Add Index
Group Rows
Transform Columns to add a sub-index.
Expand the data back.
The rest are cosmetics.
let
Source = Excel.CurrentWorkbook(),
Table1 = Source{[Name="Table1"]}[Content],
#"Added Index" = Table.AddIndexColumn(Table1, "Index", 0, 1),
#"Grouped Rows" = Table.Group(#"Added Index", {"key"}, {{"Data", each _, type table [key=number, f=text, Index=number]}}),
#"TransformColumns" = Table.TransformColumns(#"Grouped Rows",{"Data", (x) => Table.AddIndexColumn(x, "Index2", 1, 1)}),
#"Expanded Data" = Table.ExpandTableColumn(#"TransformColumns", "Data", {"excel formula", "Index", "Index2"}, {"excel formula", "Index", "Index2"}),
#"Added Custom" = Table.AddColumn(#"Expanded Data", "PQ method", each Text.From([key]) & "_" & Text.From([Index2])),
#"Sorted Rows" = Table.Sort(#"Added Custom",{{"Index", Order.Ascending}}),
#"Removed Columns" = Table.RemoveColumns(#"Sorted Rows",{"Index", "Index2"})
in
#"Removed Columns"

Power Query - best way to sub select?

Suppose I have a column representing object type and another column representing object color. I want to remove blue and red fruits (example of object type) but keep all other red and blue objects.
How can I acheive this in Power Query ?
Thanks,
Just (un)select (not) matching rows
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
Filtered = Table.SelectRows(Source, each not ([ObjectType] = "Fruit" and ([ObjectColor]="Red" or [ObjectColor]="Blue")))
in
Filtered
Here's one way:
If you start with this:
You can merge the two columns together like this:
Then filter out the "Fruit,Blue" and "Fruit,Red":
Which yields this:
And you can then delete the "Merged" column to get this:
Here's the M code:
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"ObjectType", type text}, {"ObjectColor", type text}}),
#"Inserted Merged Column" = Table.AddColumn(#"Changed Type", "Merged", each Text.Combine({[ObjectType], [ObjectColor]}, ","), type text),
#"Filtered Rows" = Table.SelectRows(#"Inserted Merged Column", each ([Merged] <> "Fruit,Blue" and [Merged] <> "Fruit,Red")),
#"Removed Columns" = Table.RemoveColumns(#"Filtered Rows",{"Merged"})
in
#"Removed Columns"

Resources