Issue with KQL / Kusto distinct, extend and project - distinct

I don't really understand what's the issue if you want to use the command distinct and distinct count AFTER a project or extend operators.
let planting_table = ['events.all']
| where FullName_Name == "plant_seed"
| extend SeedName = EventData.Payload.SeedName,
OasisName = EventData.Payload.OasisName,
TileName = EventData.Payload.TileName
| project-away SchemaVersion, FullName_Namespace, Entity_Id, Entity_Type,
EntityLineage_title, EventData, EntityLineage_title_player_account,
EntityLineage_namespace, ExperimentVariants, FullName_Name
| project-rename id = EntityLineage_master_player_account
;
planting_table
| summarize dcount(SeedName) by id
My goal is to make a distinct count of the seedname by ID in Kusto / KQL. How can I do that?
Why I cannot use distinct after a extend or a project operator?
Thanks for the help!

try casting the dynamic property named SeedName to string, using the tostring() function, so that you can aggregate over it using the distinct operator.
i.e.
...
| extend SeedName = tostring(EventData.Payload.SeedName),
...
| summarize dcount(SeedName) by id

Related

How do i strip rows inside a Column-Table, based on the "outer" tables value in Power Query?

Beeing pretty new to Power Query, I find myself faced with this problem I wish to solve.
I have a TableA with these columns. Example:
Key | Sprint | Index
-------------------------
A | PI1-I1 | 1
A | PI1-I2 | 2
B | PI1-I3 | 1
C | PI1-I1 | 1
I want to end up with a set looking like this:
Key | Sprint | Index | HasSpillOver
-------------------------
A | PI1-I1 | 1 | Yes
A | PI2-I2 | 2 | No
B | PI1-I3 | 1 | No
C | PI1-I1 | 1 | No
I thought I could maybe nestedjoin TableA on itself and then compare indicies and strip them away and then count rows in the table, like outlined below.
TableA=Key, Sprint, Index
// TableA Nested joined on itself (Key, Sprint, Index, Nested)
TableB=NestedJoin(#"TableA", "Key", #"TableA", "Key", "Nested", JoinKind.Inner)
TableC= Table.TransformColumns(#"TableB", {"Nested", (x)=>Table.SelectRows(x, each [Index] <x[Index])} )
.. and then do the count, however this throws an error:
Can not apply operator < on types List and Number.
Any suggestions how to approach this problem? Possibly (probably) in a different way.
You did not define very well what "spillover" means but this should get you most of the way
Mine assumes adding another index. You could use what you have if it is relevant
Then the code counts the number of rows where the (2nd) index is higher, and the [Key] field matches. You could add code so that the Sprint field matches as well if relevant
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Added Index" = Table.AddIndexColumn(Source, "Index.1", 0, 1),
#"Added Custom" = Table.AddColumn(#"Added Index" ,"Count",(i)=>Table.RowCount(Table.SelectRows(#"Added Index" , each [Key]=i[Key] and [Index.1]>i[Index.1])))
in #"Added Custom"

fill time gaps with power query

I have following data
start stop status
+-----------+-----------+-----------+
| 09:01:10 | 09:01:40 | active |
| 09:02:30 | 09:04:50 | active |
| 09:10:01 | 09:11:50 | active |
+-----------+-----------+-----------+
I want to fill in the gaps with "passive"
start stop status
+-----------+-----------+-----------+
| 09:01:10 | 09:01:40 | active |
| 09:01:40 | 09:02:30 | passive |
| 09:02:30 | 09:04:50 | active |
| 09:04:50 | 09:10:01 | passive |
| 09:10:01 | 09:11:50 | active |
+-----------+-----------+-----------+
How can I do this in M Query language?
You could try something like the below (my first two steps someTable and changedTypes are just to re-create your sample data on my end):
let
someTable = Table.FromColumns({{"09:01:10", "09:02:30", "09:10:01"}, {"09:01:40", "09:04:50", "09:11:50"}, {"active", "active", "active"}}, {"start","stop","status"}),
changedTypes = Table.TransformColumnTypes(someTable, {{"start", type duration}, {"stop", type duration}, {"status", type text}}),
listOfRecords = Table.ToRecords(changedTypes),
transformList = List.Accumulate(List.Skip(List.Positions(listOfRecords)), {listOfRecords{0}}, (listState, currentIndex) =>
let
previousRecord = listOfRecords{currentIndex-1},
currentRecord = listOfRecords{currentIndex},
thereIsAGap = currentRecord[start] <> previousRecord[stop],
recordsToAdd = if thereIsAGap then {[start=previousRecord[stop], stop=currentRecord[start], status="passive"], currentRecord} else {currentRecord},
append = listState & recordsToAdd
in
append
),
backToTable = Table.FromRecords(transformList, type table [start=duration, stop=duration, status=text])
in
backToTable
This is what I start off with (at the changedTypes step):
This is what I end up with:
To integrate with your existing M code, you'll probably need to:
remove someTable and changedTypes from my code (and replace with your existing query)
change changedTypes in the listOfRecords step to whatever your last step is called (otherwise you'll get an error if you don't have a changedTypes expression in your code).
Edit:
Further to my answer, what I would suggest is:
Try changing this line in the code above:
listOfRecords = Table.ToRecords(changedTypes),
to
listOfRecords = List.Buffer(Table.ToRecords(changedTypes)),
I found that storing the list in memory reduced my refresh time significantly (maybe ~90% if quantified). I imagine there are limits and drawbacks (e.g. if the list can't fit), but might be okay for your use case.
Do you experience similar behaviour? Also, my basic graph indicates non-linear complexity of the code overall unfortunately.
Final note: I found that generating and processing 100k rows resulted in a stack overflow whilst refreshing the query (this might have been due to the generation of input rows and may not the insertion of new rows, don't know). So clearly, this approach has limits.
I think I may have a better performing solution.
From your source table (assuming it's sorted), add an index column starting from 0 and an index column starting from 1 and then merge the table with itself doing a left outer join on the index columns and expand the start column.
Remove columns except for stop, status, and start.1 and filter out nulls.
Rename columns to start, status, and stop and replace "active" with "passive".
Finally, append this table to your original table.
let
Source = Table.RenameColumns(#"Removed Columns",{{"Column1.2", "start"}, {"Column1.3", "stop"}, {"Column1.4", "status"}}),
Add1Index = Table.AddIndexColumn(Source, "Index", 1, 1),
Add0Index = Table.AddIndexColumn(Add1Index, "Index.1", 0, 1),
SelfMerge = Table.NestedJoin(Add0Index,{"Index"},Add0Index,{"Index.1"},"Added Index1",JoinKind.LeftOuter),
ExpandStart1 = Table.ExpandTableColumn(SelfMerge, "Added Index1", {"start"}, {"start.1"}),
RemoveCols = Table.RemoveColumns(ExpandStart1,{"start", "Index", "Index.1"}),
FilterNulls = Table.SelectRows(RemoveCols, each ([start.1] <> null)),
RenameCols = Table.RenameColumns(FilterNulls,{{"stop", "start"}, {"start.1", "stop"}}),
ActiveToPassive = Table.ReplaceValue(RenameCols,"active","passive",Replacer.ReplaceText,{"status"}),
AppendQuery = Table.Combine({Source, ActiveToPassive}),
#"Sorted Rows" = Table.Sort(AppendQuery,{{"start", Order.Ascending}})
in
#"Sorted Rows"
This should be O(n) complexity with similar logic to #chillin, but I think should be faster than using a custom function since it will be using a built-in merge which is likely to be highly optimized.
I would approach this as follows:
Duplicate the first table.
Replace "active" with "passive".
Remove the start column.
Rename stop to start.
Create a new stop column by looking up the earliest start time from your original table that occurs after the current stop time.
Filter out nulls in this new column.
Append this table to the original table.
The M code will look something like this:
let
Source = <...your starting table...>
PassiveStatus = Table.ReplaceValue(Source,"active","passive",Replacer.ReplaceText,{"status"}),
RemoveStart = Table.RemoveColumns(PassiveStatus,{"start"}),
RenameStart = Table.RenameColumns(RemoveStart,{{"stop", "start"}}),
AddStop = Table.AddColumn(RenameStart, "stop", (C) => List.Min(List.Select(Source[start], each _ > C[start])), type time),
RemoveNulls = Table.SelectRows(AddStop, each ([stop] <> null)),
CombineTables = Table.Combine({Source, RemoveNulls}),
#"Sorted Rows" = Table.Sort(CombineTables,{{"start", Order.Ascending}})
in
#"Sorted Rows"
The only tricky bit above is the custom column part where I define the new column like this:
(C) => List.Min(List.Select(Source[start], each _ > C[start]))
This takes each item in the column/list Source[start] and compares it to the time in the current row. It selects only the ones that occur after the time in the current row and then take the min over that list to find the earliest one.

Extracting a specific number from a string using regex function in Spark SQL

I have a table in mysql which has POST_ID and corresponding INTEREST:
I used following regular expression query to select interest containing 1,2,3.
SELECT * FROM INTEREST_POST where INTEREST REGEXP '(?=.*[[:<:]]1[[:>:]])(?=.*[[:<:]]3[[:>:]])(?=.*[[:<:]]2[[:>:]])';
I imported the table in HDFS. However, when I use the same query in SparkSQL, it shows null records.
How to use REGEXP function here in spark to select interest containing 1,2,3?
The Regex you are using need to be changed a bit. You could do something like the following.
scala> val myDf2 = spark.sql("SELECT * FROM INTEREST_POST where INTEREST REGEXP '^[1-3](,[1-3])*$'")
myDf2: org.apache.spark.sql.DataFrame = [INTEREST_POST_ID: int, USER_POST_ID: int ... 1 more field]
scala> myDf2.show
+----------------+------------+--------+
|INTEREST_POST_ID|USER_POST_ID|INTEREST|
+----------------+------------+--------+
| 1| 1| 1,2,3|
I got the solution. You can do something like this:
var result = hiveContext.sql("""SELECT USER_POST_ID
| FROMINTEREST_POST_TABLE
| WHERE INTEREST REGEXP '(?=.*0[1])(?=.*0[2])(?=.*0[3])' """)
result.show
Fetching Records from INTEREST_POST_TABLE

Kotlin - Sort List by using formatted date string (functional)

I'm trying to create a Kotlin REST API, which retrieves values from a PostgreSQL database. Now the values in these results are f.e. "14-10-2016 | 15:48" and "01-08-2015 | 09:29" So the syntax basically is dd-MM-yyyy | hh:mm
Now what I'm trying to do is create a function that will order them by date placed. (Assume these strings are in an array)
var list = listOf("14-10-2016 | 15:48",
"01-08-2015 | 09:29",
"15-11-2016 | 19:43")
What would be the cleanest (and most functional) way of sorting these? (so f.e. is there a way where I don't have to take substrings of the day, month, etc. cast them to an Int, compare them in a nested loop and write the results to a different array? (that's the only way I could think of).
More than one approach can be used. It depends on how you process after you get the sorted result.
Points to note:
java.time.LocalDateTime has already implemented
java.lang.Comparable<T> Interface. We can use the kotlin stdlib List.sortBy to sort the List<LocalDateTime> directly.
Ref:
// https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.collections/sorted-by.html
fun <T, R : Comparable<R>> Iterable<T>.sortedBy(
selector: (T) -> R?
): List<T>
The easiest way is to transform the String -> java.time.LocalDateTime and use the List.sortBy directly.
The whole implementation could be like this:
import java.time.LocalDateTime
import java.time.format.DateTimeFormatter
...
// Create a convert function, String -> LocalDateTime
val dateTimeStrToLocalDateTime: (String) -> LocalDateTime = {
LocalDateTime.parse(it, DateTimeFormatter.ofPattern("dd-MM-yyyy | HH:mm"))
}
val list = listOf("14-10-2016 | 15:48",
"01-08-2015 | 09:29",
"15-11-2016 | 19:43")
// You will get List<LocalDateTime> sorted in ascending order
list.map(dateTimeStrToLocalDateTime).sorted()
// You will get List<LocalDateTime> sorted in descending order
list.map(dateTimeStrToLocalDateTime).sortedDescending()
// You will get List<String> which is sorted in ascending order
list.sortedBy(dateTimeStrToLocalDateTime)
// You will get List<String> which is sorted in descending order
list.sortedByDescending(dateTimeStrToLocalDateTime)
If you want to use org.joda.time.DateTime, you can just make a tiny change on the convert function.
A friendly reminder, always pick val as your first choice in Kotlin :).
Another alternative to the excellent Saravana answer (for minimalist and compact freaks like me..) is:
val cmp = compareBy<String> { LocalDateTime.parse(it, DateTimeFormatter.ofPattern("dd-MM-yyyy | HH:mm")) }
list.sortedWith(cmp).forEach(::println)
01-08-2015 | 09:29
14-10-2016 | 15:48
15-11-2016 | 19:43
Ps: it is the default variable for single inputs
You could use DateTimeFormatter to parse and then compare with LocalDateTime
List<String> dates = Arrays.asList("14-10-2016 | 15:48", "01-08-2015 | 09:29", "15-11-2016 | 19:43");
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("dd-MM-yyyy | HH:mm");
List<LocalDateTime> dateTimes = dates.stream().map(date -> LocalDateTime.parse(date, formatter)).sorted().collect(Collectors.toList());
System.out.println(dateTimes);
Output
[2015-08-01T09:29, 2016-10-14T15:48, 2016-11-15T19:43]
Update
You could simply convert to LocalDateTime in Comparator alone
List<String> sortedDates = dates.stream().sorted(Comparator.comparing(date -> LocalDateTime.parse(date, formatter))).collect(Collectors.toList());
output
[01-08-2015 | 09:29, 14-10-2016 | 15:48, 15-11-2016 | 19:43]
If you are using a custom object dates list sorted as below.
println("--- ASC ---")
dates.sortBy { it.year }
println("--- DESC ---")
dates.sortByDescending { it.year }
You can use sortByDescending {it.field} for descending order.

Implementing tables in lua to access specific pieces for later use

I am trying to make a table store 3 parts which will each be huge in length. The first is the name, second is EID, third is SID. I want to be able to get the information like this name[1] gives me the first name in the list of names, and like so for the other two. I'm running into problems with how to do this because it seems like everyone has their own way which are all very very different from one another. right now this is what I have.
info = {
{name = "btest", EID = "19867", SID = "664"},
{name = "btest1", EID = "19867", SID = "664"},
{name = "btest2", EID = "19867", SID = "664"},
{name = "btest3", EID = "19867", SID = "664"},
}
Theoretically speaking would i be able to just say info.name[1]? Or how else would I be able to arrange the table so I can access each part separately?
There are two main "ways" of storing the data:
Horizontal partitioning (Object-oriented)
Store each row of the data in a table. All tables must have the same fields.
Advantages: Each table contains related data, so it's easier passing it around (e.g, f(info[5])).
Disadvantages: A table is to be created for each element, adding some overhead.
This looks exactly like your example:
info = {
{name = "btest", EID = "19867", SID = "664"},
-- etc ...
}
print(info[2].names) -- access second name
Vertical partioning (Array-oriented)
Store each property in a table. All tables must have the same length.
Advantages: Less tables overall, and slightly more time and space efficient (Lua VM uses actual arrays).
Disadvantages: Needs two objects to refer to a row: the table and the index. It's harder to insert/delete.
Your example would look like this:
info = {
names = { "btest", "btest1", "btest2", "btest3", },
EID = { "19867", "19867", "19867", "19867", },
SID = { "664", "664", "664", "664", },
}
print(info.names[2]) -- access second name
So which one should I choose?
Unless you are really need performance, you should go with horizontal partitioning. It's far more common working over full rows, and gives you more freedom in how you use your structures. If you decide to go full OO, having your data in horizontal form will be much easier.
Addendum
The names "horizontal" and "vertical" come from the table representation of a relational database.
| names | EID | SID | | names |
--+-------+-----+-----+ +-------+
1 | | | | | | --+-------+-----+-----+
2 | | | | | | 2 | | | |
3 | | | | | | --+-------+-----+-----+
Your info table is an array, so you can access items using info[N] where N is any number from 1 to the number of items in the table. Each field of the info table is itself a table. The 2nd item of info is info[2], so the name field of that item is info[2].name.

Resources