Here is my measure :
CREATE MEMBER CURRENTCUBE.[Measures].[ContactNumber] AS
nonempty(
UNORDER(
(UNORDER([Contact].[Contact Id].[Contact Id].MEMBERS)
,{linkmember([Period].[Per Quarter].currentmember,[Period Ending].[Per Quarter]).NextMember : STRTOMEMBER('TAIL([Period Ending].[Per Quarter].[' + [Period].[Per Quarter].currentmember.LEVEL.name +'],1)(0)')}
,{NULL :[Period].[Per Quarter].currentmember}
,[Category].[Category].currentmember)
)
,[MAX_BeginDate]
).count
it gives me how much customers there are in a category at a period
my fact table is liked
contact periodin periodout category
A 25 26 cat1
A 26 27 cat2
A 27 end cat3
B 1 26 cat0
B 26 end cat1
C 1 2 cat2
C 3 4 cat2
C 4 end cat3
And my dimensions :
Period regular by periodin
Period ending regular by periodout
contact regular by contact
category regular by category
So for the 26th, I will have :
cat0 0
cat1 1(B)
cat2 1(A)
cat3 1(C)
If someone think to an obvious improvement...
it tooks me over 1min-1min30 for all 4categories in one day of 2017. There are more than 100 million rows in the table fact. Every customer has at least 1fact. Calendar begins in 2000 and there are 60 million of customers.
Thank you
Regards
Antho
Maybe this is faster:
CREATE MEMBER CURRENTCUBE.[Measures].[ContactNumber] AS
SUM(
UNORDER(
(UNORDER([Contact].[Contact Id].[Contact Id].MEMBERS)
,{linkmember([Period].[Per Quarter].currentmember,[Period Ending].[Per Quarter]).NextMember : STRTOMEMBER('TAIL([Period Ending].[Per Quarter].[' + [Period].[Per Quarter].currentmember.LEVEL.name +'],1)(0)')}
,{NULL :[Period].[Per Quarter].currentmember}
,[Category].[Category].currentmember)
),
IIF(
ISEMPTY([MAX_BeginDate])
,NULL
,1
)
)
I think linkmember is a slow function - is there any alternative you can use?
Thank you for answering,
I beleive I already tried replacing the count function by sum without good results.
But as soon as I go back to work I will try your proposition.
Yes I can replace all linkmember by reconstructing every member with level, current member name and strtomember function. It is something I can also try.
I also have the period of customer entry. I use it to know if the customer is "new" at the current period. And maybe it could be possible not browsing all the period dimension from the begining but from the "in period" of the customer...
Related
I've been trying to find an answer for this, but haven't succeeded - I need to sum a column for a specified date range, as long as my rowname matches the reference sheet's column name.
i.e
Reference_Sheet
Date John Matt
07/01/19 1 2
07/02/19 1 2
07/03/19 2 1
07/04/19 1 1
07/05/19 3 3
07/06/19 1 2
07/07/19 1 1
07/08/19 5 9
07/09/19 9 2
Sheet1
A B
1 07/01
2 07/07
3 Week1
4 John 10
5 Matt 12
Have to work in google sheets, and I tried using SUMPRODUCT which told me I can't multiply texts and I tried SUMIFS which let me know I can't have different array arguments - failed efforts were similar to below,
=SUMIFS('Reference_Sheet'!B2:AO1000,'Reference_Sheet'!A1:AO1,"=A4",'Reference_Sheet'!A2:A1000,">=B1",'Reference_Sheet'!A2:A1000,"<=B2")
=SUMPRODUCT(('Reference_Sheet'!$A$2:$AO$1000)*('Reference_Sheet'!$A$2:$A$1000>=B$1)*('Reference_Sheet'!$A$2:$A$1000<=B$2)*('Reference_Sheet'!$A$1:$AO$1=$A4))
This might work:
=sumifs(indirect("Reference_Sheet!"&address(2,match(A4,Reference_Sheet!A$1:AO$1,0))&":"&address(100,match(A4,Reference_Sheet!A$1:AO$1,0))),Reference_Sheet!A$2:A$100,">="&B$1,Reference_Sheet!A$2:A$100,"<="&B$2)
But you'll need to specify how many rows down you need it to go. In my formula, it looks down till 100 rows.
To change the number of rows, you need to change the number in three places:
&address(100
Reference_Sheet!A$2:A$100," ... in two places
To briefly explain what is going on:
look for the person's name in row 1 using match
Use address and indirect to build the address of cells to add
and then sumIfs() based on dates.
alternative:
=SUMPRODUCT(QUERY(TRANSPOSE(QUERY($A:$D,
"where A >= date '"&TEXT(F$1, "yyyy-mm-dd")&"'
and A <= date '"&TEXT(F$2, "yyyy-mm-dd")&"'", 1)),
"where Col1 = '"&$E4&"'", 0))
I trying to find a way to get the most recent row for each serie in a measurement.
for example:
Assuming the series in results measurement are:
> select series from test_result
results,service=MyService,team=A
result,service=MyService,team=B
test_result,service=MyService,team=C
and the rows in a given time frame are:
> select * from test_result order by time desc
time service team status duration
---- ------- ---- ------ --------
1523370939000000000 MyService A 1 300
1523370940000000000 MyService B 1 300
1523370941000000000 MyService A 1 300
1523370941000000000 MyService C 1 300
1523371748000000000 MyService A 1 300
1523371749000000000 MyService B 1 300
1523371750000000000 MyService B 1 300
1523371754000000000 MyService A 1 300
I would expect the query to return the first, second and fourth rows.
Any help is much appreciated.
Thanks!
Thanks to Katy from Influx Staff who answered the question:
To separate the series, you can add GROUP BY , which will give you
the results separated by series. Then you can add aggregates to your
query, like LAST. For example: SELECT LAST(field_name), from
test_result GROUP BY *
Keep in mind that your fields are also a factor here. You can use *
without specifying a field, but there’s room for error there. It’s
better to specify a field if you know what you need.
I have a table of every product purchased by every client over 25 years. The table contains client#, product, start date, and end date.
The products can be owned by the client for any amount of time (1 day to 100 years). While the client owns products with us, the client is active. If a client ends all products they cease to be a client. I want to count new client starts each year. The problem is, some clients end all products then start purchasing products again years later (but clients always retain the same client#) - If the client leaves then rejoins year's later I want to count the client as a new client.
I have created DAX code to do this which works perfectly on a small file, but the code uses up too many resources and so I cannot use it on my data (about 200,000 records). I know my code is HIGHLY INEFFICIENT and could probably be cleaned up...but I am not sure how. Alternately, if I could figure out how to make these columns in PowerQuery, perhaps that would work
Here is how I do it.
1) Add four calculated columns to my table:
VeryFirstStart = Calculate(
Min('Products'[StartDate]),
ALLEXCEPT(Products,Products[ClientNumber]))=Products[StartDate]
this flags records that contain the first ever start date of any client
MaxEndDateofEarlierDates = Calculate(
Max('Products'[EndDate]),
Filter(
Filter(ALLEXCEPT(Products, Products[ClientNumber]), Products[EndDate]),
Products[StartDate] < EARLIER(Products[StartDate])))
This step blows up my PowerBI - this shows the date of any NEW product purchases where the new start date occurs AFTER an ending date
Second+Start = And(
Products[MaxEndDateofEarlierDates]<>BLANK(),
Products[MaxEndDateofEarlierDates]<Products[StartDate])
this flags records where we want to count the new start date as a new client
NewStart = OR(Products[Second+Start],Products[VeryFirstStart])
**this flags ANY new client start date regardless of whether it was the first or a subsequent*
Finally I added this measure:
!MemberNewStarts = CALCULATE(
DISTINCTCOUNT(Products[ClientNumber]),
FILTER(
'Products',
('Products'[StartDate] <= LASTDATE('DIMDate'[Date]) &&
'Products'[StartDate]>= FIRSTDATE('DIMDate'[Date]) &&
Products[NewStart]=TRUE())))
Does anyone have any suggestions about how to achieve this with less resources?
Thanks
Here is some data to try
MemberNumber Product StartDate EndDate Note (not in real data)
1 A 02/02/2003 02/02/2004
1 C 02/02/2009 02/02/2010
2 A 02/02/2001 02/02/2002
2 C 02/02/2001 02/02/2002
2 B 02/02/2005 02/02/2010
3 C 02/02/2002 02/02/2005
3 B 02/02/2002 02/02/2005
3 A 02/02/2003 02/02/2008
4 B 02/02/2002 02/02/2003
4 C 02/02/2003 02/02/2006
5 B 02/02/2003 02/02/2007
5 C 02/02/2005 02/02/2010
5 A 02/02/2005 02/02/2007
6 A 02/02/2001 02/02/2006
6 C 02/02/2003 02/02/2007
7 B 02/02/2001 02/02/2004
7 A 02/02/2001 02/02/2005
7 C 02/02/2005 02/02/2006
8 B 02/02/2002 02/02/2006
8 A 02/02/2004 02/02/2009
note member 1 starts as a new client in 2009 since all previous products ended in 2004 and member 2 starts as a new client in 2005 since all previous products ended in 2002
The desired outcome is:
Start Year 2001 2002 2003 2004 2005 2006 2007 2008
New Clients 3 3 2 0 1 0 0 0
Here's one way of trying to solve it. Let me know if this is any more efficient than yours:
1st New Column:
PreviousHighestFinish:=
Calculate(
Max(Products[EndDate]),
ALLEXCEPT(Products,Products[ClientNumber]),
Products[StartDate] < Earlier(Products[StartDate]
)
This will give you the latest end date where the Client Number matches and the start date is before the current start date. If there is no earlier start date, it returns a blank.
2nd New Column:
NewClientProduct:=
if(Products[StartDate]>=Products[PreviousHighestFinish],1,0)
This will give you a 1 for every row where the client has either not been seen before (and the previous column showed blank) or the client has ben seen before, but has no current products.
The problem with this measure is that if you have a client starting more than one product on the same date, they will show as multiple new clients.
The fix for this is to count up the instances of each client-date combination
3rd New Column:
ClientDateCount:=
CALCULATE(
COUNTROWS(Products),
ALLEXCEPT(Products,Products[ClientNumber],Products[StartDate])
)
This essentially gives the number of times that the client on this row in the table has started a product on this date.
Now divide the 2nd new column by this one
4th New Column:
NewClients:=
DIVIDE(Products[NewClientProduct],Products[ClientDateCount])
And voila:
I have a panel data set for multiple waves (13) for roughly 10,000 individuals each year, with people entering and exiting at various time points. I am interested in what happens as people become diagnosed with a disease over time. Therefore I need to recode the time variable so that it becomes t=0 the first wave when diagnosed, then t=1 is the next year and so on, so that all of my individuals are comparable (and I guess -1 for t-1 etc). However I am unsure about how to go about this in stata. Would anyone be able to advise? Many thanks
The case of one diagnosis per person
clear all
set more off
*----- example data -----
set obs 100
set seed 2357
generate id = _n
generate year = floor(10 * runiform()) + 1990
expand 5
bysort id: replace year = year + _n
bysort id (year): generate diag = cond(_n == 3, 1, 0)
list in 1/20, sepby(id)
*----- what you seek -----
bysort id (diag): gen time = year - year[_N]
sort id year
list in 1/20
I assume the same data structure as #RichardHerron and use his example. diag is an indicator variable that takes on the value of 1 at the time of diagnosis and 0 otherwise (only one diagnosis per person is considered).
The sorting done by bysort is critical. The observation holding the time of diagnosis is pushed to the end of the database (by id groups) and then all that's left to do is compare (subtract) all years with that reference year. See help _variables for details on system variables like _N.
The case of multiple diagnoses per person
If several diagnoses are made per person, but we care only for the first occurence (according to year), we could do:
gsort id diag -year
by id: gen time = year - year[_N]
Simple but not optimal solution
Suppose diagnosis is 1 when diagnosed (at most once per person) and 0 otherwise.
Then the time at diagnosis is at its simplest
egen time_diagnosis = total(diagnosis * year), by(id)
but you need to ignore any zeros. To spell that out,
replace time_diagnosis = . if time_diagnosis == 0
Better alternative
A more complicated but preferable alternative can handle multiple diagnoses if they occur:
egen time_diagnosis = min(year / diagnosis), by(id)
as year / diagnosis is year when diagnosis is 1 and missing otherwise. This yields missing values if there is no diagnosis, which is as it should be.
Then you subtract that to get a new time variable.
gen time2 = time - time_diagnosis
In short, I think you can get this done in two statements, handling panel structure too.
Update
#Richard Herron asks why use egen with by(), and not just
gen time_diagnosis = time * diagnosis
A limitation of that is that the "correct" value is contained only in those observations for which diagnosis is 1; that value still has to be "spread" to other values for the same id. But that is precisely what egen does here. In the simplest situation, with one diagnosis the total of time * diagnosis is just time * 1 or time, as any zeros make no difference to the sum.
It is usually helpful to provide test data, but here they are easy enough to generate. The trick is to find the first year for each individual (my fyear), which I'll do with min() from egen. Then I'll subtract this first year fyear from the actual year to find the year relative to diagnosis ryear.
/* generate panel */
clear
set obs 10000
generate id = _n
generate year = floor(10 * runiform()) + 1990
expand 10
bysort id: replace year = year + _n
sort id year
list in 1/20
/* generate relative year */
bysort id: egen fyear = min(year)
generate ryear = year - fyear
list in 1/20
If the first year in the panel is not diagnosis, then just construct fyear based on diagnosis criteria.
Edit: Thinking more on this, maybe it's the last part that you're having a hard time with (i.e., identifying the diagnosis year to subtract from the calendar year). Here's what I would do.
bysort id (year): generate diagnosis = cond(_n == 5, 1, 0)
preserve
tempfile diagnosis
keep if (diagnosis == 1)
rename year dyear
keep id dyear
save `diagnosis'
restore
merge m:1 id using `diagnosis', nogenerate
generate ryear2 = year - dyear
I'm not even sure how to word this, so an example:
I have two models,
Chicken
id
name
EggCounterReadings
id
chicken_id
value_on_counter
timestamp
I don't always record a count for every chicken when I do counts.
Using ActiveRecord how do I get the latest egg count per chicken?
So if I have 1 chicken and 3 counts, the counts would be 1 today, 15 tomorrow, and 18 the next day. That chicken has laid 18 eggs, not 34
UPDATE: Found exactly what I was trying to do in MySQL. Find "The Rows Holding the Group-wise Maximum of a Certain Column". So I need to .find_by_sql("SELECT * FROM (SELECT * FROM EggCounterReadings WHERE <conditions> ORDER BY timestamp DESC) GROUP BY chicken_id")
Given your updated question, I've changed my answer.
chicken = Chicken.first
count = chicken.egg_counter_readings.last.value_on_counter
If you don't want the latest record, but the largest egg yield, then try this:
chicken = Chicken.first
count = chicken.egg_counter_readings.maximum(value_on_counter)
I believe that should do what you want.