fill time gaps with power query - powerquery

I have following data
start stop status
+-----------+-----------+-----------+
| 09:01:10 | 09:01:40 | active |
| 09:02:30 | 09:04:50 | active |
| 09:10:01 | 09:11:50 | active |
+-----------+-----------+-----------+
I want to fill in the gaps with "passive"
start stop status
+-----------+-----------+-----------+
| 09:01:10 | 09:01:40 | active |
| 09:01:40 | 09:02:30 | passive |
| 09:02:30 | 09:04:50 | active |
| 09:04:50 | 09:10:01 | passive |
| 09:10:01 | 09:11:50 | active |
+-----------+-----------+-----------+
How can I do this in M Query language?

You could try something like the below (my first two steps someTable and changedTypes are just to re-create your sample data on my end):
let
someTable = Table.FromColumns({{"09:01:10", "09:02:30", "09:10:01"}, {"09:01:40", "09:04:50", "09:11:50"}, {"active", "active", "active"}}, {"start","stop","status"}),
changedTypes = Table.TransformColumnTypes(someTable, {{"start", type duration}, {"stop", type duration}, {"status", type text}}),
listOfRecords = Table.ToRecords(changedTypes),
transformList = List.Accumulate(List.Skip(List.Positions(listOfRecords)), {listOfRecords{0}}, (listState, currentIndex) =>
let
previousRecord = listOfRecords{currentIndex-1},
currentRecord = listOfRecords{currentIndex},
thereIsAGap = currentRecord[start] <> previousRecord[stop],
recordsToAdd = if thereIsAGap then {[start=previousRecord[stop], stop=currentRecord[start], status="passive"], currentRecord} else {currentRecord},
append = listState & recordsToAdd
in
append
),
backToTable = Table.FromRecords(transformList, type table [start=duration, stop=duration, status=text])
in
backToTable
This is what I start off with (at the changedTypes step):
This is what I end up with:
To integrate with your existing M code, you'll probably need to:
remove someTable and changedTypes from my code (and replace with your existing query)
change changedTypes in the listOfRecords step to whatever your last step is called (otherwise you'll get an error if you don't have a changedTypes expression in your code).
Edit:
Further to my answer, what I would suggest is:
Try changing this line in the code above:
listOfRecords = Table.ToRecords(changedTypes),
to
listOfRecords = List.Buffer(Table.ToRecords(changedTypes)),
I found that storing the list in memory reduced my refresh time significantly (maybe ~90% if quantified). I imagine there are limits and drawbacks (e.g. if the list can't fit), but might be okay for your use case.
Do you experience similar behaviour? Also, my basic graph indicates non-linear complexity of the code overall unfortunately.
Final note: I found that generating and processing 100k rows resulted in a stack overflow whilst refreshing the query (this might have been due to the generation of input rows and may not the insertion of new rows, don't know). So clearly, this approach has limits.

I think I may have a better performing solution.
From your source table (assuming it's sorted), add an index column starting from 0 and an index column starting from 1 and then merge the table with itself doing a left outer join on the index columns and expand the start column.
Remove columns except for stop, status, and start.1 and filter out nulls.
Rename columns to start, status, and stop and replace "active" with "passive".
Finally, append this table to your original table.
let
Source = Table.RenameColumns(#"Removed Columns",{{"Column1.2", "start"}, {"Column1.3", "stop"}, {"Column1.4", "status"}}),
Add1Index = Table.AddIndexColumn(Source, "Index", 1, 1),
Add0Index = Table.AddIndexColumn(Add1Index, "Index.1", 0, 1),
SelfMerge = Table.NestedJoin(Add0Index,{"Index"},Add0Index,{"Index.1"},"Added Index1",JoinKind.LeftOuter),
ExpandStart1 = Table.ExpandTableColumn(SelfMerge, "Added Index1", {"start"}, {"start.1"}),
RemoveCols = Table.RemoveColumns(ExpandStart1,{"start", "Index", "Index.1"}),
FilterNulls = Table.SelectRows(RemoveCols, each ([start.1] <> null)),
RenameCols = Table.RenameColumns(FilterNulls,{{"stop", "start"}, {"start.1", "stop"}}),
ActiveToPassive = Table.ReplaceValue(RenameCols,"active","passive",Replacer.ReplaceText,{"status"}),
AppendQuery = Table.Combine({Source, ActiveToPassive}),
#"Sorted Rows" = Table.Sort(AppendQuery,{{"start", Order.Ascending}})
in
#"Sorted Rows"
This should be O(n) complexity with similar logic to #chillin, but I think should be faster than using a custom function since it will be using a built-in merge which is likely to be highly optimized.

I would approach this as follows:
Duplicate the first table.
Replace "active" with "passive".
Remove the start column.
Rename stop to start.
Create a new stop column by looking up the earliest start time from your original table that occurs after the current stop time.
Filter out nulls in this new column.
Append this table to the original table.
The M code will look something like this:
let
Source = <...your starting table...>
PassiveStatus = Table.ReplaceValue(Source,"active","passive",Replacer.ReplaceText,{"status"}),
RemoveStart = Table.RemoveColumns(PassiveStatus,{"start"}),
RenameStart = Table.RenameColumns(RemoveStart,{{"stop", "start"}}),
AddStop = Table.AddColumn(RenameStart, "stop", (C) => List.Min(List.Select(Source[start], each _ > C[start])), type time),
RemoveNulls = Table.SelectRows(AddStop, each ([stop] <> null)),
CombineTables = Table.Combine({Source, RemoveNulls}),
#"Sorted Rows" = Table.Sort(CombineTables,{{"start", Order.Ascending}})
in
#"Sorted Rows"
The only tricky bit above is the custom column part where I define the new column like this:
(C) => List.Min(List.Select(Source[start], each _ > C[start]))
This takes each item in the column/list Source[start] and compares it to the time in the current row. It selects only the ones that occur after the time in the current row and then take the min over that list to find the earliest one.

Related

Issue with KQL / Kusto distinct, extend and project

I don't really understand what's the issue if you want to use the command distinct and distinct count AFTER a project or extend operators.
let planting_table = ['events.all']
| where FullName_Name == "plant_seed"
| extend SeedName = EventData.Payload.SeedName,
OasisName = EventData.Payload.OasisName,
TileName = EventData.Payload.TileName
| project-away SchemaVersion, FullName_Namespace, Entity_Id, Entity_Type,
EntityLineage_title, EventData, EntityLineage_title_player_account,
EntityLineage_namespace, ExperimentVariants, FullName_Name
| project-rename id = EntityLineage_master_player_account
;
planting_table
| summarize dcount(SeedName) by id
My goal is to make a distinct count of the seedname by ID in Kusto / KQL. How can I do that?
Why I cannot use distinct after a extend or a project operator?
Thanks for the help!
try casting the dynamic property named SeedName to string, using the tostring() function, so that you can aggregate over it using the distinct operator.
i.e.
...
| extend SeedName = tostring(EventData.Payload.SeedName),
...
| summarize dcount(SeedName) by id

How do i strip rows inside a Column-Table, based on the "outer" tables value in Power Query?

Beeing pretty new to Power Query, I find myself faced with this problem I wish to solve.
I have a TableA with these columns. Example:
Key | Sprint | Index
-------------------------
A | PI1-I1 | 1
A | PI1-I2 | 2
B | PI1-I3 | 1
C | PI1-I1 | 1
I want to end up with a set looking like this:
Key | Sprint | Index | HasSpillOver
-------------------------
A | PI1-I1 | 1 | Yes
A | PI2-I2 | 2 | No
B | PI1-I3 | 1 | No
C | PI1-I1 | 1 | No
I thought I could maybe nestedjoin TableA on itself and then compare indicies and strip them away and then count rows in the table, like outlined below.
TableA=Key, Sprint, Index
// TableA Nested joined on itself (Key, Sprint, Index, Nested)
TableB=NestedJoin(#"TableA", "Key", #"TableA", "Key", "Nested", JoinKind.Inner)
TableC= Table.TransformColumns(#"TableB", {"Nested", (x)=>Table.SelectRows(x, each [Index] <x[Index])} )
.. and then do the count, however this throws an error:
Can not apply operator < on types List and Number.
Any suggestions how to approach this problem? Possibly (probably) in a different way.
You did not define very well what "spillover" means but this should get you most of the way
Mine assumes adding another index. You could use what you have if it is relevant
Then the code counts the number of rows where the (2nd) index is higher, and the [Key] field matches. You could add code so that the Sprint field matches as well if relevant
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Added Index" = Table.AddIndexColumn(Source, "Index.1", 0, 1),
#"Added Custom" = Table.AddColumn(#"Added Index" ,"Count",(i)=>Table.RowCount(Table.SelectRows(#"Added Index" , each [Key]=i[Key] and [Index.1]>i[Index.1])))
in #"Added Custom"

Avoid multiple sums in custom crossfilter reduce functions

This question arise from some difficulties in creating a crossfilter dataset, in particular on how to group the different dimension and compute a derived values. The final aim is to have a number of dc.js graphs using the dimensions and groups.
(Fiddle example https://jsfiddle.net/raino01r/0vjtqsjL/)
Question
Before going on with the explanation of the setting, the key question is the following:
How to create custom add, remove, init, functions to pass in .reduce so that the first two do not sum multiple times the same feature?
Data
Let's say I want to monitor the failure rate of a number of machines (just an example). I do this using different dimension: month, machine's location, and type of failure.
For example I have the data in the following form:
| month | room | failureType | failCount | machineCount |
|---------|------|-------------|-----------|--------------|
| 2015-01 | 1 | A | 10 | 5 |
| 2015-01 | 1 | B | 2 | 5 |
| 2015-01 | 2 | A | 0 | 3 |
| 2015-01 | 2 | B | 1 | 3 |
| 2015-02 | . | . | . | . |
Expected
For the three given dimensions, I should have:
month_1_rate = $\frac{10+2+0+1}{5+3}$;
room_1_rate = $\frac{10+2}{5}$;
type_A_rate = $\frac{10+0}{5+3}$.
Idea
Essentially, what counts in this setting is the couple (day, room). I.e. given a day and a room there should be a rate attached to them (then the crossfilter should act to take in account the other filters).
Therefore, a way to go could be to store the couples that have already been used and do not sum machineCount for them - however we still want to update the failCount value.
Attempt (failing)
My attempt was to create custom reduce functions and not summing MachineCount that were already taken into account.
However there are some unexpected behaviours. I'm sure this is not the way to go - so I hope to have some suggestion on this.
// A dimension is one of:
// ndx = crossfilter(data);
// ndx.dimension(function(d){return d.month;})
// ndx.dimension(function(d){return d.room;})
// ndx.dimension(function(d){return d.failureType;})
// Goal: have a general way to get the group given the dimension:
function get_group(dim){
return dim.group().reduce(add_rate, remove_rate, initial_rate);
}
// month is given as datetime object
var monthNameFormat = d3.time.format("%Y-%m");
//
function check_done(p, v){
return p.done.indexOf(v.room+'_'+monthNameFormat(v.month))==-1;
}
// The three functions needed for the custom `.reduce` block.
function add_rate(p, v){
var index = check_done(p, v);
if (index) p.done.push(v.room+'_'+monthNameFormat(v.month));
var count_to_sum = (index)? v.machineCount:0;
p.mach_count += count_to_sum;
p.fail_count += v.failCount;
p.rate = (p.mach_count==0) ? 0 : p.fail_count*1000/p.mach_count;
return p;
}
function remove_rate(p, v){
var index = check_done(p, v);
var count_to_subtract = (index)? v.machineCount:0;
if (index) p.done.push(v.room+'_'+monthNameFormat(v.month));
p.mach_count -= count_to_subtract;
p.fail_count -= v.failCount;
p.rate = (p.mach_count==0) ? 0 : p.fail_count*1000/p.mach_count;
return p;
}
function initial_rate(){
return {rate: 0, mach_count:0, fail_count:0, done: new Array()};
}
Connection with dc.js
As mentioned, the previous code is needed to create dimension, group to be passed in three different bar graphs using dc.js.
Each graph will have .valueAccessor(function(d){return d.value.rate};).
See the jsfiddle (https://jsfiddle.net/raino01r/0vjtqsjL/), for an implementation. Different numbers, but the datastructure is the same. Notice the in the fiddle you expect a Machine count to be 18 (in both months), however you always get the double (because of the 2 different locations).
Edit
Reduction + dc.js
Following Ethan Jewett answer, I used reductio to take care of the grouping. The updated fiddle is here https://jsfiddle.net/raino01r/dpa3vv69/
My reducer object needs two exception (month, room), when summing the machineCount values. Hence it is built as follows:
var reducer = reductio()
reducer.value('mach_count')
.exception(function(d) { return d.room; })
.exception(function(d) { return d.month; })
.exceptionSum(function(d) { return d.machineCount; })
reducer.value('fail_count')
.sum(function(d) { return d.failCount; })
This seems to fix the numbers when the graphs are rendered.
However, I do have a strange behaviour when filtering one single month and looking at the numbers in the type graph.
Possible solution
Rather double create two exception, I could merge the two fields when processing the data. I.e. as soon the data is defined I couls:
data.foreach(function(x){
x['room_month'] = x['room'] + '_' + x['month'];
})
Then the above reduction code should become:
var reducer = reductio()
reducer.value('mach_count')
.exception(function(d) { return d.room_month; })
.exceptionSum(function(d) { return d.machineCount; })
reducer.value('fail_count')
.sum(function(d) { return d.failCount; })
This solution seems to work. However I am not sure if this is a sensible things to do: if the dataset is large,adding a new feature could slow down things quite a lot!
A few things:
Don't calculate rates in your Crossfilter reducers. Calculate the components of the rates. This will keep both simpler and faster. Do the actual division in your value accessor.
You've basically got the right idea. I think there are two problems that I see immediately:
In your remove_rate your are not removing the key from the p.done array. You should be doing something like if (index) p.done.splice(p.done.indexOf(v.room+'_'+monthNameFormat(v.month)), 1); to remove it.
In your reduce functions, index is a boolean. (index == -1) will never evaluate to true, IIRC. So your added machine count will always be 0. Use var count_to_sum = index ? v.machineCount:0; instead.
If you want to put together a working example, I or someone else will be happy to get it going for you, I'm sure.
You may also want to try Reductio. Crossfilter reducers are difficult to do right and efficiently, so it may make sense to use a library to help. With Reductio, creating a group that calculates your machine count and failure count looks like this:
var reducer = reductio()
reducer.value('mach_count')
.exception(function(d) { return d.room; })
.exceptionSum(function(d) { return d.machineCount; })
reducer.value('fail_count')
.sum(function(d) { return d.failCount; })
var dim = ndx.dimension(...)
var grp = dim.group()
reducer(group)

Spark: How to create a sessionId based on userId and timestamp

Sorry for a newbie question.
Currently I have log files which contains fields such as: userId, event, and timestamp, while lacking of the sessionId. My aim is to create a sessionId for each record based on the timestamp and a pre-defined value TIMEOUT.
If the TIMEOUT value is 10, and sample DataFrame is:
scala> eventSequence.show(false)
+----------+------------+----------+
|uerId |event |timestamp |
+----------+------------+----------+
|U1 |A |1 |
|U2 |B |2 |
|U1 |C |5 |
|U3 |A |8 |
|U1 |D |20 |
|U2 |B |23 |
+----------+------------+----------+
The goal is:
+----------+------------+----------+----------+
|uerId |event |timestamp |sessionId |
+----------+------------+----------+----------+
|U1 |A |1 |S1 |
|U2 |B |2 |S2 |
|U1 |C |5 |S1 |
|U3 |A |8 |S3 |
|U1 |D |20 |S4 |
|U2 |B |23 |S5 |
+----------+------------+----------+----------+
I find one solution in R (Create a "sessionID" based on "userID" and differences in "timeStamp"), while I am not able to figure it out in Spark.
Thanks for any suggestions on this problem.
Shawn's answer regards on "How to create a new column", while my aim is to "How to create an sessionId column based on timestamp". After days of struggling, the Window function is applied in this scenario as a simple solution.
Window is introduced since Spark 1.4, it provides functions when such operations is needed:
both operate on a group of rows while still returning a single value for every input row
In order to create a sessionId based on timestamp, first I need to get the difference between user A's two immediate operations. The windowDef defines the Window will be partition by "userId" and ordered by timestamp, then diff is a column which will return a value for each row, whose value will be 1 row after the current row in the partition(group), or null if the current row is the last row in this partition
def handleDiff(timeOut: Int) = {
udf {(timeDiff: Int, timestamp: Int) => if(timeDiff > timeOut) timestamp + ";" else timestamp + ""}
}
val windowDef = Window.partitionBy("userId").orderBy("timestamp")
val diff: Column = lead(eventSequence("timestamp"), 1).over(windowDef)
val dfTSDiff = eventSequence.
withColumn("time_diff", diff - eventSequence("timestamp")).
withColumn("event_seq", handleDiff(TIME_OUT)(col("time_diff"), col("timestamp"))).
groupBy("userId").agg(GroupConcat(col("event_seq")).alias("event_seqs"))
Updated:
Then exploit the Window function to apply the "cumsum"-like operation (provided in Pandas):
// Define a Window, partitioned by userId (partitionBy), ordered by timestamp (orderBy), and delivers all rows before current row in this partition as frame (rowsBetween)
val windowSpec = Window.partitionBy("userId").orderBy("timestamp").rowsBetween(Long.MinValue, 0)
val sessionDf = dfTSDiff.
withColumn("ts_diff_flag", genTSFlag(TIME_OUT)(col("time_diff"))).
select(col("userId"), col("eventSeq"), col("timestamp"), sum("ts_diff_flag").over(windowSpec).alias("sessionInteger")).
withColumn("sessionId", genSessionId(col("userId"), col("sessionInteger")))
Previously:
Then split by ";" and get each session, create a sessionId; afterwards split by "," and explodes to final result. Thus sessionId is created with the help of string operations.
(This part should be replaced by cumulative sum operation instead, however I did not find a good solution)
Any idea or thought about this question is welcomed.
GroupConcat could be found here: SPARK SQL replacement for mysql GROUP_CONCAT aggregate function
Reference: databricks introduction
dt.withColumn('sessionId', expression for the new column sessionId)
for example:
dt.timestamp + pre-defined value TIMEOUT

Implementing tables in lua to access specific pieces for later use

I am trying to make a table store 3 parts which will each be huge in length. The first is the name, second is EID, third is SID. I want to be able to get the information like this name[1] gives me the first name in the list of names, and like so for the other two. I'm running into problems with how to do this because it seems like everyone has their own way which are all very very different from one another. right now this is what I have.
info = {
{name = "btest", EID = "19867", SID = "664"},
{name = "btest1", EID = "19867", SID = "664"},
{name = "btest2", EID = "19867", SID = "664"},
{name = "btest3", EID = "19867", SID = "664"},
}
Theoretically speaking would i be able to just say info.name[1]? Or how else would I be able to arrange the table so I can access each part separately?
There are two main "ways" of storing the data:
Horizontal partitioning (Object-oriented)
Store each row of the data in a table. All tables must have the same fields.
Advantages: Each table contains related data, so it's easier passing it around (e.g, f(info[5])).
Disadvantages: A table is to be created for each element, adding some overhead.
This looks exactly like your example:
info = {
{name = "btest", EID = "19867", SID = "664"},
-- etc ...
}
print(info[2].names) -- access second name
Vertical partioning (Array-oriented)
Store each property in a table. All tables must have the same length.
Advantages: Less tables overall, and slightly more time and space efficient (Lua VM uses actual arrays).
Disadvantages: Needs two objects to refer to a row: the table and the index. It's harder to insert/delete.
Your example would look like this:
info = {
names = { "btest", "btest1", "btest2", "btest3", },
EID = { "19867", "19867", "19867", "19867", },
SID = { "664", "664", "664", "664", },
}
print(info.names[2]) -- access second name
So which one should I choose?
Unless you are really need performance, you should go with horizontal partitioning. It's far more common working over full rows, and gives you more freedom in how you use your structures. If you decide to go full OO, having your data in horizontal form will be much easier.
Addendum
The names "horizontal" and "vertical" come from the table representation of a relational database.
| names | EID | SID | | names |
--+-------+-----+-----+ +-------+
1 | | | | | | --+-------+-----+-----+
2 | | | | | | 2 | | | |
3 | | | | | | --+-------+-----+-----+
Your info table is an array, so you can access items using info[N] where N is any number from 1 to the number of items in the table. Each field of the info table is itself a table. The 2nd item of info is info[2], so the name field of that item is info[2].name.

Resources