Gremlin Sack Sum Once Per Distinct Value - filter

Referencing this example from Practical Gremlin and this stack overflow post:
Gremlin Post Filter Path
g.withSack(0).V().
has('code','AUS').
repeat(out().simplePath().has('country',within('US','UK')).
choose(has('code','MAN'),sack(sum).by(constant(1)))).
until(has('code','EDI')).
where(sack().is(1)).
path().by('code').
limit(10)
Is it possible to perform a sack sum in such a way as to only sum the first time a property is found. For instance, instead of the 'code' property inside of the choose() which will only sum once per 'code' encountered thanks to the simplePath(), what if there was another property called 'airport_color'. As we perform the traversal, I would only want the sack sum to increment the first time it encountered 'blue' or 'white' as an example, even though multiple airports could have the same color as we go through the traversal. This would help me in the where() clause because, if I had a couple of colors I was interested in looking for as an example (maybe blue and white), I could set the where() clause to be equal to two and know that two wasn't arrived at just because I passed through blue twice but because blue and white was encountred.
I tried using aggregation to make the sack sum increment only on the first encounter but couldn't get it to work, something like this:
g.withSack(0).V().
has('code','AUS').
repeat(out().simplePath().has('country',within('US','UK')).
choose(has('airport_color','blue').has('airport_color', without('airport_color_agg')),sack(sum).by(constant(1))).
aggregate('airport_color_agg').by('airport_color')).
until(has('code','EDI')).
where(sack().is(1)).
path().by('code').
limit(10)
There could be multiple colors in the choose() via or() but I limited it to just one to keep the example more straightforward.
Thanks for your help!

Using the sample graph below:
g.addV('A').property(id,'a1').property('color','red').as('a1').
addV('A').property(id,'a2').property('color','blue').as('a2').
addV('A').property(id,'a3').property('color','red').as('a3').
addV('A').property(id,'a4').property('color','yellow').as('a4').
addV('A').property(id,'a5').property('color','green').as('a5').
addV('A').property(id,'a6').property('color','blue').as('a6').
addE('R').from('a1').to('a2').
addE('R').from('a2').to('a3').
addE('R').from('a3').to('a4').
addE('R').from('a4').to('a5').
addE('R').from('a5').to('a6')
we can inspect the path through the graph as follows:
g.V('a1').
repeat(out()).
until(not(out())).
path().
by('color')
which shows us the colors found
path[red, blue, red, yellow, green, blue]
the next thing we need to do is to remove the duplicates and filter out the colors not in our want list.
g.withSideEffect('want',['red','green']).
V('a1').
repeat(out()).
until(not(out())).
path().
by('color').
dedup(local).
unfold().
where(within('want'))
which gives us:
red
green
finally, we just need to count them:
g.withSideEffect('want',['red','green']).
V('a1').
repeat(out()).
until(not(out())).
path().
by('color').
dedup(local).
unfold().
where(within('want')).
count()
Which, as expected, gives us:
2
UPDATED 2022-09-01 To reflect discussion in comments.
To change the query so that only paths that visit each of the required colors at least once are returned, the previous steps leading up to the count need to be turned into a filter.
g.withSideEffect('want',['red','green']).
V('a1').
repeat(out()).
until(not(out())).
filter(
path().
by('color').
dedup(local).
unfold().
where(within('want')).
count().is(2)).
path()
which for our sample graph returns:
1 path[v[a1], v[a2], v[a3], v[a4], v[a5], v[a6]]
The query as written gets the job done and hopefully is not too hard to follow. There are some things we could change/improve.
Pass in the count for the is step as another parameter like the want list.
Rather than pass in a parameter, use the size of the want list instead. This makes the query a little more complex.
Use a sack as the query proceeds to collect the colors seen. The query gets more complex in that case as, while you can maintain a list in a sack, updating it requires a few extra steps.
Here is the query rewritten to use a sack. This assumes the start node has a color that should be included. If that is not the case, the first sack(assign) can be removed and a withSack([]) added after the withSideEffect.
g.withSideEffect('want',['red','green']).
V('a1').
sack(assign).by(values('color').fold()).
repeat(out().sack(assign).by(union(sack().unfold(),values('color')).fold())).
until(not(out())).
filter(
sack().
unfold().
dedup().
where(within('want')).
count().is(2)).
path()

Related

Reporting Multiple Values & Sorting

Having a bit of an issue and unsure if it's actually possible to do.
I'm working on a file that I will enter target progression vs actual target reporting the % outcome.
PAGE 1
¦NAME ¦TAR 1 %¦TAR 2 %¦TAR 3 %¦TAR 4 %¦OVERALL¦SUB 1¦SUB 2¦SUB 3¦
¦NAME1¦ 114%¦ 121%¦ 100%¦ 250%¦ 146%¦ 2¦ 0¦ 0%¦
¦NAME2¦ 88%¦ 100%¦ 90%¦ 50%¦ 82%¦ 0¦ 1¦ 0%¦
¦NAME3¦ 82%¦ 54%¦ 64%¦ 100%¦ 75%¦ 6¦ 6¦ 15%¦
¦NAME4¦ 103%¦ 64%¦ 56%¦ 43%¦ 67%¦ 4¦ 4¦ 24%¦
¦NAME5¦ 87%¦ 63%¦ 89%¦ 0%¦ 60%¦ 3¦ 2¦ 16%¦
Now I already have it sorting all rows by the Overall % column so I can quickly see at a glance but I am creating a second page that I need to reference points.
So on the second page I would like to somehow sort and reference different columns for example
PAGE 2
TOP TAR 1¦Name of top %¦Top %¦
TOP TAR 2¦Name of top %¦Top %¦
Is something like this possible to do?
Essentially I'm creating an Employee of the Month form that automatically works out who has topped what.
I'm willing to drop a paypal donation for whoever can figure this out for me as I've been doing it manually every month and would appreciate the time saved
I don't think a complicated array formula is necessary for this - I am suggesting a fairly standard Index/Match approach.
First set up the row titles - you can just copy and transpose them from Page 1, or use a formula in A2 of Page 2 like
=transpose('Page 1'!B1:E1)
The use them in an index/match to get the data in the corresponding column of the main sheet and find its maximum (in C2)
=max(index('Page 1'!A:E,0,match(A2,'Page 1'!A$1:E$1,0)))
Finally look up the maximum in the main sheet to find the corresponding name:
=index('Page 1'!A:A,match(C2,index('Page 1'!A:E,0,match(A2,'Page 1'!A$1:E$1,0)),0))
If you think there could be a tie for first place with two or more people getting the same score, you could use a filter to get the different names:
So if the max score is in B8 this time (same formula)
=max(index('Page 1'!A:E,0,match(A8,'Page 1'!A$1:E$1,0)))
the different names could be spread across the corresponding row using transpose (in C8)
=ArrayFormula(TRANSPOSE(filter('Page 1'!A:A,index('Page 1'!A:E,0,match(A8,'Page 1'!A$1:E$1,0))=B8)))
I have changed the test data slightly to show these different scenarios
Results

rowscap and filter applied in wrong order in DC.JS rowChart

Still using DC.JS to get some analysis tools written for our tool performance. Thanks so much for having this library available.
I am trying to show which recipe setup times are the worst for a given set of data. Everything works great as long as you show the whole group. When you only display the specified topN using .rowscap on the rowChart the following happens:
The chart will show the right number of bars and they are even sorted properly but the chart has picked the topN unfiltered bars first and then ordered them. I want it to pick the topN from the ordered list, not the other way around. See jsfiddle for demo. (http://jsfiddle.net/za8ksj45/24/)
in the fiddle, the longest setup time belongs to recipeD.
But if you have more than two recipes selected before recipeD
it is dropped of the right (top2) chart.
line 099-110: reductio definition
line 120-140: removal of empty bins (works okay)
(This is very similar to a problem Gordon helped resolved earlier (dc.js rowChart topN without zeros) and I reused the code from that solution. Something went 'wrong' when I combined it with the reductio.js library.)
I think I am not returning the value portion of the reductio group somewhere but have been unable to figure it out. Any help would be appreciated.
The issue is that at the time you .slice(0,n) the group in your function to remove empty bins, the group is not ordered, so you effectively get a random 2 groups, not the top 2 groups. This is actually clear from the unfiltered view, as the "top2" view shows the 2nd and 3rd group from the "all" view, not the actual top 2 (at least for me).
The previous example worked because Crossfilter's standard groups are ordered by default, but in the case of a complex group like the one you are generating with Reductio, what should it order by? There's no way it can know, so Reductio doesn't mess with the ordering at all, which I suppose means it is ordering by the value property, which is an object.
You need to add one line to order your FactsByRecipe group by average and I think it should fix your problem:
FactsByRecipe.order(function(d) { return d.avg; });
Note that there can only be one ordering on a Crossfilter group, so if you want to show "top X" for more than one property of that group you'll need to create another wrapper (like the remove empty bins wrapper) but have the "top" function re-sort the group by the ordering you want.
Good luck!

Interpreting Cascading dot diagrams

Can someone explain how to read these diagrams? I understand the flow from head to tail, but I am specifically wondering about how to read the field (bracket) transitions between ellipses (Pipes/Taps).
By way of example using the Fields following the Every Pipe in the image, the way I have been able to interpret these is the first Field set i.e. [{2}:'token', 'count'] is what goes into the next Pipe/Tap, but what is the significance of the second Field set [{1}: 'token']?
Is this the field set that went into the previous Pipe above? Is there a programmatic significance to the second bracket i.e. are we able to access it within that pipe with particular Cascading code? (In the case where the second Fields set is greater than the first)
(source: cascading.org)
The second field set represents which fields are available for subsequent operations in that map or reduce.
In your example above, in the reduce step, since you grouped by 'token', only 'token' is available for subsequent aggregations (Everys) in that reduce step. You could, for example, add another aggregation which output the average token length, but you could not use an aggregation which utilized the 'count' yet.
The reason for this behaviour is that subsequent aggregations on the same group happen in parallel. Thus, the Count won't be completed to feed into any other aggregations you chained on.

Count nodes based on two or more attribute values

I need to count how many times a particular node occurs in a document based on the values of two if its attributes. So, given the following small sample of XML:
<p:entry timestamp="2012-11-15T17:53:34.642-05:00" ticks="89709622449012" system="OSD" component="OSD5" marker=".\Launcher.cpp:1741" severity="Info" type="Driver" subtype="Start" tags="" sensitivity="false">
This can occur one or more times in the document with different attribute sets. I need to count how many show up with type="Driver" AND subtype="Start". I am able to count how many just have type="Driver" using:
count(//p:entry[#type="Driver"])
but haven't been able to combine them. This didn't work:
count(//p:entry[#type="Driver" and #subtype="Start"])
This works for the OP. Specify 2 predicates in succession instead of using operator and result in the same effect:
count(//p:entry[#type="Driver"][#subtype="Start"])
By right, the original code count(//p:entry[#type="Driver" and #subtype="Start"]) should work, as far as my knowledge goes.

Trouble with facet counts

I'm attempting to use ElasticSearch for analytics -- specifically to track "top content" for hand-rolled Rails CMS. The requirement is quite a bit more complicated than keeping a counter for each piece of content. I won't get into the depth of problem right now, as I can't seem to get even the basics working.
My problem is this: I'm using facets and the counts aren't what I expect them to be. For example:
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":1,"all_terms":false,"order":"count"}}}}
Result:
{"el_ids":{"_type":"terms","missing":0,"total":16672,"other":16657,"terms":[{"term":"quis","count":15}]}}
Ok, great, the piece of content with id "quis" had 15 hits and since the order is count, it should be my top piece of content. Now lets get the top 5 pieces of content.
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":5,"all_terms":false,"order":"count"}}}}
Result (just the facet):
[
{"term":"qgz9","count":26},
{"term":"quis","count":15},
{"term":"hnqn","count":15},
{"term":"higp","count":15},
{"term":"csns","count":15}
]
Huh? So the piece of content w/ id "qgz9" had more hits with 26? Why wasn't it the top result in the first query?
Ok, lets get the top 100 now.
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":100,"all_terms":false,"order":"count"}}}}
Results (just the facet):
[
{"term":"qgz9","count":43},
{"term":"difc","count":37},
{"term":"zryp","count":31},
{"term":"u65r","count":31},
{"term":"sxsi","count":31},
...
]
So now "qgz9" has 43 hits instead of 26? How can that be? I can assure you there's nothing happening in the background modifying the index. If I repeat these queries, I get the same results.
As I repeat this process of increasing the result size, counts continue to change and new content ids emerge at the top. Can someone explain to me what I'm doing wrong or where my understanding of how this works is flawed?
It turns out that this is a known issue:
...the way top N facets work now is by getting the top N from each shard, and merging the results. This can give inaccurate results.
By default, my index was being created with 5 shards. By changing this so the index only has a single shard, the counts behave inline with my expectations. Another workaround would be to always set size to a value greater than the number of expected facets and peel off the top N results.

Resources