Power query - strategy for handling repeating rows - powerquery

Given a report which as a table with repeated row headings, is there a good strategy for using Power Query/M to extract the data in a clean format?
For example the report available here, has an excel file (which at time of writing is pointing to August 2021):
https://www.opec.org/opec_web/static_files_project/media/downloads/publications/MOMR%20Appendix%20Tables%20(August%202021).xlsx
In this example:
we have the World demand table portion
Non-OPEC Liquids production portion
both of these have rows: Americas/Europe/Asia Pacific:
which makes it hard to distinguish them in Power Query
What is right approach which would allow extraction of data from this type of table?

I would add a column ... custom column ... with formula
=if [2018] = null then [Column] else null
and then right click the new column and fill down
That would put World Demand and non-OPEC as a column that you could additionally filter on

Related

Compare Dynamic Lists Power BI

I have a table ("Issues") which I am creating in PowerBI from a JIRA data connector, so this changes each time I refresh it. I have three columns I am using
Form Name
Effort
Status
I created a second table and have summarized the Form Names and obtained the Total Effort:
SUMMARIZE(Issues,Issues[Form Name],"Total Effort",SUM(Issues[Effort (Days)]))
But I also want to add in a column for
Total Effort for each form name where the Status field is "Done"
My issue is that I don't know how to compare both tables / form names since these might change each time I refresh the table.
I need to write a conditional, something like
For each form name, print the total effort for each form name, print the total effort for each form name where the status is done
I have tried SUMX, CALCULATE, SUM, FILTER but cannot get these to work - can someone help, please?
If all you need is to add a column to your summarized table that sums "Effort" only when the Status is set to 'Done' -- then this is the right place to use CALCULATE.
Table =
SUMMARIZE(
Issues,
Issues[Form Name],
"Total Effort", SUM(Issues[Effort]),
"Total Effort (Done)", CALCULATE(SUM(Issues[Effort]), Issues[Status] = "Done")
)
Here is a quick capture of what some of the mock data that I used to test this looks like. The Matrix is just the mock data with [Form Name] on the rows and [Status] on the columns. The last table shows the 'summarized' data calculated by the DAX above. You can compare this to the values in the matrix and see that they tie out.

Data Studio Table Chart is not sorting correctly

I have a working chart of podcast episodes by download count in a query. That query is used to create a table chart in Data Studio. The file name formats are as follows: 2020/889-Jan-16-2020-DMP.mp3
Well Episode 1000 isn't showing at the top now in the sorting order. Because it thinks 1000 is less than 999. See table below:
2020/999-Jun-24-2020-DMP.mp3
2020/998-Jun-23-2020-DMP.mp3
2020/997-Jun-22-2020-DMP.mp3
2020/996-Jun-21-2020-DMP.mp3
2020/995-Jun-18-2020-DMP.mp3
2020/994-Jun-17-2020-DMP.mp3
2020/993-Jun-16-2020-DMP.mp3
continuing ...
2020/886-Jan-13-2019-DMP.mp3
2020/885-Jan-12-2019-DMP.mp3
2020/884-Jan-9-2019-DMP.mp3
2020/883-Jan-8-2019-DMP.mp3
2020/882-Jan-7-2019-DMP.mp3
2020/881-Jan-6-2019-DMP.mp3
2020/880-Jan-5-2019-DMP.mp3
2020/879-Jan-2-2019-DMP.mp3
2020/1001-Jun-30-2020-DMP.mp3 <-------Should be at the top of the table
2020/1000-Jun-29-2020-DMP.mp3 <-------Should be at the top of the table
2019/878-Dec-19-2019-DMP.mp3
2019/877-Dec-18-2019-DMP.mp3
2019/876-Dec-17-2019-DMP.mp3
Let me know if that makes sense...
1) REGEXP_EXTRACT
It can be achieved by adding the REGEXP_EXTRACT Calculated Field below as the Sort field and setting the order to Descending (The RegEx extracts the respective number component, for example 1001):
AVG(CAST(REGEXP_EXTRACT(Field, "^\\d+/(\\d+)-") AS NUMBER ) )
Google Data Studio Report and a GIF to elaborate:
2) Troubleshooting Calculated Fields (Invalid Formula)
Adding a section on general troubleshooting for Calculated Fields based on an Earlier Post on the Google Data Studio Forum:
Field Editing: Have a look at whether Field Editing is Enabled (Although it shouldn't affect creating Data Source Calculated fields);
Refresh: Refresh the Data Source Fields as well as a Fields in the Report;
Page Reload: Shortcut - F5;
Hard Page Reload: Shortcut - Ctrl + F5;
Chart-level Calculated Field: Double check whether using a Chart-level Calculated Field instead of a Data Source-level Calculated Field, resolves the issue.
I found that doing this worked too - sort date granularity by Year Week
Sort date granularity by Year Week

Filtering date column in Visual Studio SSIS ( Derived column)

I want to filter a column that spans from 2014-2019 to 2017-2018 in VS with SSIS.
I have tried different things but none seem to work.
Derived Column date in your example is likely what you're looking for.
The Week column is of a date type DT_DBDATE. Your string "2017-01-01" should be getting promoted to a data date type so the boolean check will identify if the lower bound is being met.
You'd either need to create a second derived column to check against the upper bound or as #vhoang indicates, change the logic to just extract the year from the date column.
YEAR([Week]) >= 2017 && YEAR([Week]) < 2019
Now, you have a column that flags each row as meets criteria or not (year is 2017 or 2018)
You will then need to do something with that. The SSIS something is called a Conditional Split. I would add a new path called OutOfConsideration and the logic there would be the inverse of our above Derived Column Derived Column date which is true if the year meets our criteria.
![Derived Column date]
Now connect your destination, or additional processing steps, to the Conditional Split's default output path. If you need to do processing on the invalid data, that'd be the OutOfConsideration path.
Finally, to get the best performance out of SSIS, only bring the rows into it that you need. If the source data is in a system that supports filtering, filter the data there. It is easy to click click click design SSIS packages but it is better long term for you to write custom queries to only bring the required columns and rows into the data flow. Less work for all around, lower maintenance cost, etc

Sort Rows in Excel?

I have an issue with an excel spreadsheet I want to see if I can do without VBA just because it seems easier to implement that way. Basically, there are many columns in the sheet I want to sort. However, I merely want to look at three columns: the title column, the data column and the status column.
In a new spreadsheet, there will be four sections. Each section corresponds to 3 months of the year (ie Jan, Feb, Mar. will map to the first column on the new spreadsheet, April, May, June will map to the second column on the new spreadsheet).
Based on the date, and if the status column has the word "Finished" (in the original spreadsheet), I want to map the title to a certain column under the new spreadsheet based on the date criteria as described in the previous paragraph. So for example, if the original spreadsheet has following:
Title Date Status
Doc1 1/12/13 Finished
Doc2 2/10/13 UnFinished
Doc3 4/1/13 Finished
Doc4 3/31/13 Finished
Would map to, on the new spreadsheet:
1st Column | 2nd Column
Doc1 Doc3
Doc4
I have looked a lot into pivot tables but I can't "automate it" as much as I want to. I have gotten it down to the point where I can change the pivot tables into filtering based on date, but I want it even more automated than that. I've also tried excel formulas but that has been to no avail. Thanks for the help, I really appreciate it!
With a PivotTable it seems fairly easy to 'automate' as far as Sheet 2 as below:
but from there to the result requested is relatively 'manual' without VBA, so may not suit.
For my convenience I have changed the date formats. The PivotTable is constructed as usual/indicated without showing grand totals for rows or for columns (PivotTable Options, Totals & Filters). The Column Labels are Date with Grouping By Quarters with appropriate Starting at: and Ending at: (Group) and Collapse Entire Field (Expand/Collapse).
The formula in I6 is to convert the document count (always 1) to document name:
=IF(F6=1,$E6,"")
However, to allow room for additional quarters in the PivotTable the formula should be moved to the right. The formula would need to be copied across and down as necessary.
The process becomes more ‘manual’ with copying the results of these formulae, pasting them (with Special / Values) into a new location (in the example 2!A1) and, if required, deleting blanks.
This may be against the rules with regards to maintaining the integrity of the OP's request, but hopefully it doesn't offend :)
Here's another option.
Add another column (shame on me, I know) to the original data, and
called this Quarter. The formula that goes next to the existing data
is the following.
=IF(C2="Finished",IF(MONTH(B2)<=3,"Q1",IF(MONTH(B2)<=6,
"Q2",IF(MONTH(B2)<=9,"Q3","Q4"))),C2)
Basically, if the status is "Finished", then determine in what quarter the date is.
Create the pivot table with that data, and then add "Quarter" and
"Title" to the Row Labels (in that order)
Last thing would be to click the arrow next to "Row Labels" and select "Does not Equal" under "Label Filters". There you'll type "Unfinished" (no quotation marks). This will give you something like the image below.
From here the only manual thing you'll need to do is update the data range for the pivot table if more rows are added to the pivot table data and refresh the pivot table if the original data changes
NOTE: To address your question about sorting; after you do the steps above, you can select the Row Labels again and do an A>Z sort to get each quarter to be sorted in alphabetical order

Hive: How to have a derived column that has stores the sentiment value from the sentiment analysis API

Here's the scenario:
Say you have a Hive Table that stores twitter data.
Say it has 5 columns. One column being the Text Data.
Now How do you add a 6th column that stores the sentiment value from the Sentiment Analysis of the twitter Text data. I plan to use the Sentiment Analysis API like Sentiment140 or viralheat.
I would appreciate any tips on how to implement the "derived" column in Hive.
Thanks.
Unfortunately, while the Hive API lets you add a new column to your table (using ALTER TABLE foo ADD COLUMNS (bar binary)), those new columns will be NULL and cannot be populated. The only way to add data to these columns is to clear the table's rows and load data from a new file, this new file having that new column's data.
To answer your question: You can't, in Hive. To do what you propose, you would have to have a file with 6 columns, the 6th already containing the sentiment analysis data. This could then be loaded into your HDFS, and queried using Hive.
EDIT: Just tried an example where I exported the table as a .csv after adding the new column (see above), and popped that into M$ Excel where I was able to perform functions on the table values. After adding functions, I just saved and uploaded the .csv, and rebuilt the table from it. Not sure if this is helpful to you specifically (since it's not likely that sentiment analysis can be done in Excel), but may be of use to anyone else just wanting to have computed columns in Hive.
References:
https://cwiki.apache.org/Hive/gettingstarted.html#GettingStarted-DDLOperations
http://comments.gmane.org/gmane.comp.java.hadoop.hive.user/6665
You can do this in two steps without a separate table. Steps:
Alter the original table to add the required column
Do an "overwrite table select" of all columns + your computed column from the original table into the original table.
Caveat: This has not been tested on a clustered installation.

Resources