DAX / Power Pivot - Count occurrence substring within string of another table without relation - dax

Dears, i'm stuck in a difficult yet simple task.
I need to count the number of occurence of speficic substrings within a string in a column that as no relation.
Let me give you an example with Emojis :
Table 1:
People
Message
John
You got it 😊
Paul
So Beautifull 😍😍😍
John
😊
Paul
Good luck 😚
Table 2 :
Subgroup
Emoticones
face-smiling
😁
face-smiling
😂
face-smiling
😊
face-smiling
😇
face-affection
😍
face-affection
😚
In a pivot, i would like to be able to do this kind of thing with Puivot :
Subgroup
Total emoji
face-smiling
2
face-affection
4
Person
face-smiling
face-affection
John
2
0
Paul
0
4

I would create a calculated table using GENERATE() to iterate over each Emoticone
Then use ADDCOLUMNS() to iterate over each Messages, and add the count of the current Emoticone in the current Message.
To count the number of occurrences of an icon in a Message, compute the difference of lengths between Message and Message with removed icons (an icon counts for 2 characters)
And finally FILTER() out pairs without match:
EmotiCounts =
VAR cj = GENERATE(Emoticones,
VAR icon = Emoticones[Emoticones]
RETURN
ADDCOLUMNS(
Messages,
"Counts",
(
LEN(Messages[Message])-
LEN(
SUBSTITUTE(
Messages[Message],
icon,
""))
)/2
)
)
RETURN
FILTER( cj, [Counts] > 0 )
That table contains both People and Subgroup needed for your visuals.

Related

Expressions that yield variant data-type cannot be used to define calculated columns

Team,
I'm trying to get the last 12 months of financials, using the days as my base, and am getting the the expression yield error. I think it's because they aren't text, but calculated columns.
What is best to add to this? I tried 'value' and that didn't work. I think the isnonblank may work.. or is blank? As there will be blanks returned.
Here's the formula:
Last 12 Mn Amt = (if('LSI DP_JP_HistorySummaryBySiteMaster'[Z - Days]>=-365,'LSI DP_JP_HistorySummaryBySiteMaster'[Z Total Revenue + Rent Revenue],""))
The problem is that the TRUE result returns a different data type than the FALSE result.
Try replacing "" with blank or zero.
Last 12 Mn Amt =
IF (
'LSI DP_JP_HistorySummaryBySiteMaster'[Z - Days] >= -365,
'LSI DP_JP_HistorySummaryBySiteMaster'[Z Total Revenue + Rent Revenue],
BLANK ()
)
Note: Omitting the 3rd argument entirely returns BLANK() as the default behavior.

How do you create a new column based on Max value of 1 column and Category of another?

I am working with a bunch of data for my job creating status reports on the documents that we are working through that we then assign to an area. We decided to use PowerBI as an interactive way to see where everything is at.
Using Power BI Desktop I've created a new table that excludes documents that are not ready for QC but we have several different statuses. Instead of creating a new table for each status type (since some can be grouped together) I would like to create a new column that has the grouped status value's Max for each area. The higher the Status Value the further it is from being complete.
EX:
Record:
Area:
Status Value:
Max Status Value:
152385
A
1
2
354354
B
2
3
131322
B
3
3
132136
A
2
2
213513
A
1
2
351315
B
2
3
If anyone knows how to get the Max Status Value column that would greatly help. I did find another post (https://community.powerbi.com/t5/Desktop/LOOKUPVALUE-return-min-max-of-values-found/td-p/657534) that was similar but I'm still new to DAX and could not figure out how to apply it to my situation.
This post actually helped me answer the question.
https://community.powerbi.com/t5/Power-Query/Maxifs-Power-Query/m-p/1693606
The only difference I made was getting rid of the true/false portion to receive my results. Thus my result was:
Max Status Value =
VAR vMaxVal=
CALCULATE (
MAX ( 'Table'[Status Value] ),
ALLEXCEPT (
'Table',
'Table'[Area]
)
)
RETURN
vMaxVal

SUMIF with date range for specific column

I've been trying to find an answer for this, but haven't succeeded - I need to sum a column for a specified date range, as long as my rowname matches the reference sheet's column name.
i.e
Reference_Sheet
Date John Matt
07/01/19 1 2
07/02/19 1 2
07/03/19 2 1
07/04/19 1 1
07/05/19 3 3
07/06/19 1 2
07/07/19 1 1
07/08/19 5 9
07/09/19 9 2
Sheet1
A B
1 07/01
2 07/07
3 Week1
4 John 10
5 Matt 12
Have to work in google sheets, and I tried using SUMPRODUCT which told me I can't multiply texts and I tried SUMIFS which let me know I can't have different array arguments - failed efforts were similar to below,
=SUMIFS('Reference_Sheet'!B2:AO1000,'Reference_Sheet'!A1:AO1,"=A4",'Reference_Sheet'!A2:A1000,">=B1",'Reference_Sheet'!A2:A1000,"<=B2")
=SUMPRODUCT(('Reference_Sheet'!$A$2:$AO$1000)*('Reference_Sheet'!$A$2:$A$1000>=B$1)*('Reference_Sheet'!$A$2:$A$1000<=B$2)*('Reference_Sheet'!$A$1:$AO$1=$A4))
This might work:
=sumifs(indirect("Reference_Sheet!"&address(2,match(A4,Reference_Sheet!A$1:AO$1,0))&":"&address(100,match(A4,Reference_Sheet!A$1:AO$1,0))),Reference_Sheet!A$2:A$100,">="&B$1,Reference_Sheet!A$2:A$100,"<="&B$2)
But you'll need to specify how many rows down you need it to go. In my formula, it looks down till 100 rows.
To change the number of rows, you need to change the number in three places:
&address(100
Reference_Sheet!A$2:A$100," ... in two places
To briefly explain what is going on:
look for the person's name in row 1 using match
Use address and indirect to build the address of cells to add
and then sumIfs() based on dates.
alternative:
=SUMPRODUCT(QUERY(TRANSPOSE(QUERY($A:$D,
"where A >= date '"&TEXT(F$1, "yyyy-mm-dd")&"'
and A <= date '"&TEXT(F$2, "yyyy-mm-dd")&"'", 1)),
"where Col1 = '"&$E4&"'", 0))

How to find the maximum value from a column satisfying two or more IF conditions in DAX

I am a newbie to Power BI and DAX.
I have a dataset as attached. I need to find the maximum value for each person for each week. I have written the formula in Excel.
=MAX(IF(A$2:A$32=A2,IF(D$2:D$32=D2,IF(B$2:B$32=1,C$2:C$32))))
How can I convert it to DAX or write the same formula in Power BI? I tried the DAX Code as below, But it did not work(ALLEXCEPT Function expects table).
Weekly Maximum =
CALCULATE ( MAX ( PT[Value] ), ALLEXCEPT ( PT, PT[person], PT[Week],
PT[category] ==1 ) )
Once I calculate this, then I need to calculate the Expected value for each week, that has the maximum value of the previous week * 2.85, as shown in the screenshot. How can I put the previous week's maximum value for this week?
Any corrections/solutions, please?
TIA
The Max Value for Category 1 can be written like this:
= CALCULATE(MAX(PT[Value]),
ALLEXCEPT(PT, PT[Person], PT[Week]),
PT[Category] = 1)
(The Category filter doesn't go inside ALLEXCEPT().)
For your Expected Value column, you can do something similar:
= CALCULATE(2.85 * MAX(PT[Value]),
ALLEXCEPT(PT, PT[Person]),
PT[Category] = 1,
PT[Week] = EARLIER(PT[Week]) - 1)
(The EARLIER function gives you the value for the row you are in. The name refers to the earlier row context.)

Time aligned variable in stata

I have a panel data set for multiple waves (13) for roughly 10,000 individuals each year, with people entering and exiting at various time points. I am interested in what happens as people become diagnosed with a disease over time. Therefore I need to recode the time variable so that it becomes t=0 the first wave when diagnosed, then t=1 is the next year and so on, so that all of my individuals are comparable (and I guess -1 for t-1 etc). However I am unsure about how to go about this in stata. Would anyone be able to advise? Many thanks
The case of one diagnosis per person
clear all
set more off
*----- example data -----
set obs 100
set seed 2357
generate id = _n
generate year = floor(10 * runiform()) + 1990
expand 5
bysort id: replace year = year + _n
bysort id (year): generate diag = cond(_n == 3, 1, 0)
list in 1/20, sepby(id)
*----- what you seek -----
bysort id (diag): gen time = year - year[_N]
sort id year
list in 1/20
I assume the same data structure as #RichardHerron and use his example. diag is an indicator variable that takes on the value of 1 at the time of diagnosis and 0 otherwise (only one diagnosis per person is considered).
The sorting done by bysort is critical. The observation holding the time of diagnosis is pushed to the end of the database (by id groups) and then all that's left to do is compare (subtract) all years with that reference year. See help _variables for details on system variables like _N.
The case of multiple diagnoses per person
If several diagnoses are made per person, but we care only for the first occurence (according to year), we could do:
gsort id diag -year
by id: gen time = year - year[_N]
Simple but not optimal solution
Suppose diagnosis is 1 when diagnosed (at most once per person) and 0 otherwise.
Then the time at diagnosis is at its simplest
egen time_diagnosis = total(diagnosis * year), by(id)
but you need to ignore any zeros. To spell that out,
replace time_diagnosis = . if time_diagnosis == 0
Better alternative
A more complicated but preferable alternative can handle multiple diagnoses if they occur:
egen time_diagnosis = min(year / diagnosis), by(id)
as year / diagnosis is year when diagnosis is 1 and missing otherwise. This yields missing values if there is no diagnosis, which is as it should be.
Then you subtract that to get a new time variable.
gen time2 = time - time_diagnosis
In short, I think you can get this done in two statements, handling panel structure too.
Update
#Richard Herron asks why use egen with by(), and not just
gen time_diagnosis = time * diagnosis
A limitation of that is that the "correct" value is contained only in those observations for which diagnosis is 1; that value still has to be "spread" to other values for the same id. But that is precisely what egen does here. In the simplest situation, with one diagnosis the total of time * diagnosis is just time * 1 or time, as any zeros make no difference to the sum.
It is usually helpful to provide test data, but here they are easy enough to generate. The trick is to find the first year for each individual (my fyear), which I'll do with min() from egen. Then I'll subtract this first year fyear from the actual year to find the year relative to diagnosis ryear.
/* generate panel */
clear
set obs 10000
generate id = _n
generate year = floor(10 * runiform()) + 1990
expand 10
bysort id: replace year = year + _n
sort id year
list in 1/20
/* generate relative year */
bysort id: egen fyear = min(year)
generate ryear = year - fyear
list in 1/20
If the first year in the panel is not diagnosis, then just construct fyear based on diagnosis criteria.
Edit: Thinking more on this, maybe it's the last part that you're having a hard time with (i.e., identifying the diagnosis year to subtract from the calendar year). Here's what I would do.
bysort id (year): generate diagnosis = cond(_n == 5, 1, 0)
preserve
tempfile diagnosis
keep if (diagnosis == 1)
rename year dyear
keep id dyear
save `diagnosis'
restore
merge m:1 id using `diagnosis', nogenerate
generate ryear2 = year - dyear

Resources