I'm practicing with XQuery and I've reached a point where I have to get the maximum values of each day of an xml file.
My file deals with the maximum temperatures that a city has throughout the days. And it appears the same day several times but with different hours throughout the day. I will give you an example of how the data appear in the XML file:
<hour>
<date>2019-2-18</date>
<data_hour>12:00</data_hour>
<temperature>17</temperature>
</hour>
<hour>
<date>2019-2-18</date>
<data_hour>13:00</data_hour>
<temperature>18</temperature>
</hour>
<hour>
<date>2019-2-19</date>
<data_hour>14:00</data_hour>
<temperature>20</temperature>
</hour>
<hour>
<date>2019-2-19</date>
<data_hour>15:00</data_hour>
<temperature>19</temperature>
</hour>
The idea would be to take out the maximum hour of each day, any idea of how to solve it?
If XQuery 3.0 is available, you could use the group by clause to group the hour elements together for each day, then apply the max function to select the highest temperature. In this code, I assume you've bound the hour elements to a variable $hours:
for $hour in $hours
group by $date := $hour/date
return
$date || ": " || max($hour/temperature)
This returns two items given your source data:
"2019-2-19: 20",
"2019-2-18: 18"
I will leave it as an exercise for you to sort the results by date. In the process, you will likely hit the limits of the current formatting of your dates, so you will need to convert the dates to values that conform to the xs:date data type, which is reliably sortable.
If you are limited to XQuery 1.0, then see XQuery 3.0 equivalent group by in xquery 1.0 version.
Try this, add an additional parent tag like I've added
let $body := doc("1.xml")
for $i in distinct-values($body//hour/date)
return <Day value = "{data($i)}"> {
let $hours := (for $row in $body//hour
for $j in $row/date where xs:string(data($j)) = xs:string(data($i))
return fn:substring($row/data_hour, 1, 2))
return fn:max($hours)
}
</Day>
Tell me if minutes are also considered.
Related
In my cube, I have several measures at the day grain that I'd like to sum at the day grain but average (or take latest) at the month grain or year grain.
Example:
We have a Fact table with Date and number of active subscribers in that day (aka PMC). This is snapshotted per day.
dt
SubscriberCnt
1/1/22
50
1/2/22
55
This works great at the day level. At the month level, we don't want to sum these two values (count = 105) because it doesn't make sense and not accurate.
when someone is looking at month grain, it should look like this - take the latest for the month. (we may change this to do an average instead, management is still deciding)
option 1 - Take latest
Month-Dt
Subscribers
Jan-2022
55
Feb-2022
-
option 2 - Take aveage
Month-Dt
Subscribers
Jan-2022
52
Feb-2022
-
I've not been able to find the right search terms for this but this seems like a common problem.
I added some sample data at the end of a month for testing:
dt
SubscriberCnt
12/30/21
46
12/31/21
48
This formula uses LASTNONBLANKVALUE, which sorts by the first column and provides the latest value that is not blank:
Monthly Subscriber Count = LASTNONBLANKVALUE( 'Table'[dt], SUM('Table'[SubscriberCnt]) )
If you do an AVERAGE, a simple AVERAGE formula will work. If you want an average just for the current month, then try this:
Current Subscriber Count =
VAR _EOM = CLOSINGBALANCEMONTH( SUM('Table'[SubscriberCnt]), DateDim[Date] )
RETURN IF(_EOM <> 0, _EOM, AVERAGE('Table'[SubscriberCnt]) )
But the total row will be misleading, so I would add this so the total row is the latest number:
Current Subscriber Count =
VAR _EOM = CLOSINGBALANCEMONTH( SUM('Table'[SubscriberCnt]), DateDim[Date] ) //Get the number on the last day of the month
VAR _TOT = NOT HASONEVALUE(DateDim[MonthNo]) // Check if this is a total row (more than one month value)
RETURN IF(_TOT, [Monthly Subscriber Count], // For total rows, use the latest nonblank value
IF(_EOM <> 0, _EOM, AVERAGE('Table'[SubscriberCnt]) ) // For month rows, use final day if available, else use the average
)
I assigned the x - axis as the number of days, 365, and my response variable for y-axis. However, upon using seaborn, it sets my x-axis as the whole number of rows I have.
Dataset
sns.lineplot(x="DATE", y="TMEAN", data=TS, hue="YEAR", style="YEAR") plt.show
Lineplot Result
I tried to use pivot and melt, but to no avail.
An easy way to fix your problem is to use the "day of the year" instead of the column "DATE". You can extract that as a separate column from the date like in this example:
data = {'DATE': pd.date_range(start='2013-01-01', end='2021-12-31'),
'YEAR': pd.date_range(start='2013-01-01', end='2021-12-31').year,
'DAY': pd.date_range(start='2013-01-01', end='2021-12-31').dayofyear,
'TMEAN': your_temp_data}
TS = pd.DataFrame(data=data)
sns.lineplot(data=TS, x='DAY', y='TMEAN', hue="YEAR")
plt.show()
Here, the "dayofyear" gives you your x-axis (being each day without the information of which year the temperature belongs to) and the coloring hue visualizes the year.
I have a list of products and would like to get a 50 day simple moving average of its volume using Power Query (M).
The table is sorted by product name and date. I add a custom column and applied the code below.
if [date] >= #date(2018,1,29)
then List.Average(List.Range(Source[Volume],[Volume]-1,-50))
else ""
Since it is already sorted by date and name, an if statement was applied with a date as criteria/filter. However, an error occurs that says
'Volume' column not found in the table.
I expect to have an added column in the power query with volume 50 day moving average per product. the calculation to be done if date is greater than or equal Jan 29, 2018.
We don't know what your columns are, but assuming you have [product], [date] and [volume] in Source, this would average the last 50 days of [volume] for the identical [product] based on each [date], and place in a new column
AvgAmountAdded = Table.AddColumn(Source, "AverageAmount", (i) => List.Average(Table.SelectRows(Source, each ([product] = i[product] and [date]<=i[date] and [date]>=Date.AddDays(i[date],-50)))[volume]), type number)
Finally! found a solution.
First, apply Index by product see this post for further details
Then index again without criteria (index all rows)
Then, apply below code
= Table.AddColumn(#"Previous Step", "Volume SMA(50)", each if [Index_byProduct] >= 50 then List.Average(List.Range(#"Previous Step"[Volume], ([Index_All]-50),50)) else 0),
For large dataset, Table.Buffer function is recommended after index-expand step to improve PQ calculation speed
I am a newbie to Power BI and DAX.
I have a dataset as attached. I need to find the maximum value for each person for each week. I have written the formula in Excel.
=MAX(IF(A$2:A$32=A2,IF(D$2:D$32=D2,IF(B$2:B$32=1,C$2:C$32))))
How can I convert it to DAX or write the same formula in Power BI? I tried the DAX Code as below, But it did not work(ALLEXCEPT Function expects table).
Weekly Maximum =
CALCULATE ( MAX ( PT[Value] ), ALLEXCEPT ( PT, PT[person], PT[Week],
PT[category] ==1 ) )
Once I calculate this, then I need to calculate the Expected value for each week, that has the maximum value of the previous week * 2.85, as shown in the screenshot. How can I put the previous week's maximum value for this week?
Any corrections/solutions, please?
TIA
The Max Value for Category 1 can be written like this:
= CALCULATE(MAX(PT[Value]),
ALLEXCEPT(PT, PT[Person], PT[Week]),
PT[Category] = 1)
(The Category filter doesn't go inside ALLEXCEPT().)
For your Expected Value column, you can do something similar:
= CALCULATE(2.85 * MAX(PT[Value]),
ALLEXCEPT(PT, PT[Person]),
PT[Category] = 1,
PT[Week] = EARLIER(PT[Week]) - 1)
(The EARLIER function gives you the value for the row you are in. The name refers to the earlier row context.)
I am having trouble to understand why a for loop construction does not work. I am not really used to for loops so I apologize if I am missing something basic. Anyhow, I appreciate any piece of advice you might have.
I am using a party level dataset from the parlgov project. I am trying to create a variable which captures how many times a party has been in government before the current observation. Time is important, the counter should be zero if a party has not been in government before, even if after the observation period it entered government multiple times. Parties are nested in countries and in cabinet dates.
The code is as follows:
use "http://eborbath.github.io/stackoverflow/loop.dta", clear //to get the data
if this does not work, I also uploaded in a csv format, try:
import delimited "http://eborbath.github.io/stackoverflow/loop.csv", bindquote(strict) encoding(UTF-8) clear
The loop should go through each country-specific cabinet date, identify the previous observation and check if the party has already been in government. This is how far I have got:
gen date2=cab_date
gen gov_counter=0
levelsof country, local(countries) // to get to the unique values in countries
foreach c of local countries{
preserve // I think I need this to "re-map" the unique cabinet dates in each country
keep if country==`c'
levelsof cab_date, local(dates) // to get to the unique cabinet dates in individual countries
restore
foreach i of local dates {
egen min_date=min(date2) // this is to identify the previous cabinet date
sort country party_id date2
bysort country party_id: replace gov_counter=gov_counter+1 if date2==min_date & cabinet_party[_n-1]==1 // this should be the counter
bysort country: replace date2=. if date2==min_date // this is to drop the observation which was counted
drop min_date //before I restart the nested loop, so that it again gets to the minimum value in `dates'
}
}
The code works without an error, but it does not do the job. Evidently there's a mistake somewhere, I am just not sure where.
BTW, it's a specific application of a problem I super often encounter: how do you count frequencies of distinct values in a multilevel data structure? This is slightly more specific, to the extent that "time matters", and it should not just sum all encounters. Let me know if you have an easier solution for this.
Thanks!
The problem with your loop is that it does not keep the replaced gov_counter after the loop. However, there is a much easier solution I'd recommend:
sort country party_id cab_date
by country party_id: gen gov_counter=sum(cabinet_party[_n-1])
This sorts the data into groups and then creates a sum by group, always up to (but not including) the current observation.
I would start here. I have stripped the comments so that we can look at the code. I have made some tiny cosmetic alterations.
foreach i of local dates {
egen min_date = min(date2)
sort country party_id date2
bysort country party_id: replace gov_counter=gov_counter+1 ///
if date2 == min_date & cabinet_party[_n-1] == 1
bysort country: replace date2 = . if date2 == min_date
drop min_date
}
This loop includes no reference to the loop index i defined in the foreach statement. So, the code is the same and completely unaffected by the loop index. The variable min_date is just a constant for the dataset and the same each time around the loop. What does depend on how many times the loop is executed is how many times the counter is incremented.
The fallacy here appears to be a false analogy with constructs in other software, in which a loop automatically spawns separate calculations for different values of a loop index.
It's not illegal for loop contents never to refer to the loop index, as is easy to see
forval j = 1/3 {
di "Hurray"
}
produces
Hurray
Hurray
Hurray
But if you want different calculations for different values of the loop index, that has to be explicit.