Order by random in RethinkDB - rethinkdb

I want to order documents randomly in RethinkDB. The reason for this is that I return n groups of documents and each group must appear in order in the results (so all documents belonging to a group should be placed together); and I need to randomly pick a document, belonging to the first group in the results (you don't know which is the first group in the results - the first ones could be empty, so no documents are retrieved for them).
The solution I found to this is to randomly order each of the groups before concat-ing to the result, then always pick the first document from the results (as it will be random). But I'm having a hard time ordering these groups randomly. Would appreciate any hint or even a better solution if there is one.

If you want to order a selection of documents randomly you can just use .orderBy and return a random number using r.random.
r.db('test').table('table')
.orderBy(function (row) { return r.random(); })
If these document are in a group and you want to randomize them inside the group, you can just call orderBy after the group statement.
r.db('test').table('table')
.groupBy('property')
.orderBy(function (row) { return r.random(); })
If you want to randomize the order of the groups, you can just call orderBy after calling .ungroup
r.db('test').table('table')
.groupBy('property')
.ungroup()
.orderBy(function (row) { return r.random(); })

The accepted answer here should not be possible, as John mentioned the sorting function must be deterministic, which r.random() is not.
The r.sample() function could be used to return a random order of the elements:
If the sequence has less than the requested number of elements (i.e., calling sample(10) on a sequence with only five elements), sample will return the entire sequence in a random order.
So, count the number of elements you have, and set that number as the sample number, and you'll get a randomized response.
Example:
var res = r.db("population").table("europeans")
.filter(function(row) {
return row('age').gt(18)
});
var num = res.count();
res.sample(num)

I'm not getting this to work. I tried to sort an table randomly and I'm getting the following error:
e: Sorting by a non-deterministic function is not supported in:
r.db("db").table("table").orderBy(function(var_33) { return r.random(); })
Also I have read in the rethink documentation that this is not supported. This is from the rethinkdb orderBy documentation:
Sorting functions passed to orderBy must be deterministic. You cannot, for instance, order rows using the random command. Using a non-deterministic function with orderBy will raise a ReqlQueryLogicError.
Any suggestions on how to get this to work?

One simple solution would be to give each document a random number:
r.db('db').table('table')
.merge(doc => ({
random: r.random(1, 10)
})
.orderBy('random')

Related

How to get dynamic field count in dc.js numberDisplay?

I'm currently trying to figure out how to get a count of unique records to display using DJ.js and D3.js
The data set looks like this:
id,name,artists,genre,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
6DCZcSspjsKoFjzjrWoCd,God's Plan,Drake,Hip-Hop/Rap,0.754,0.449,7,-9.211,1,0.109,0.0332,8.29E-05,0.552,0.357,77.169,198973,4
3ee8Jmje8o58CHK66QrVC,SAD!,XXXTENTACION,Hip-Hop/Rap,0.74,0.613,8,-4.88,1,0.145,0.258,0.00372,0.123,0.473,75.023,166606,4
There are 100 records in the data set, and I would expect the count to display 70 for the count of unique artists.
var ndx = crossfilter(spotifyData);
totalArtists(ndx);
....
function totalArtists(ndx) {
// Select the artists
var totalArtistsND = dc.numberDisplay("#unique-artists");
// Count them
var dim = ndx.dimension(dc.pluck("artists"));
var uniqueArtist = dim.groupAll();
totalArtistsND.group(uniqueArtist).valueAccessor(x => x);
totalArtistsND.render();
}
I am only getting 100 as a result when I should be getting 70.
Thanks a million, any help would be appreciated
You are on the right track - a groupAll object is usually the right kind of object to use with dc.numberDisplay.
However, dimension.groupAll doesn't use the dimension's key function. Like any groupAll, it looks at all the records and returns one value; the only difference between dimension.groupAll() and crossfilter.groupAll() is that the former does not observe the dimension's filters while the latter observes all filters.
If you were going to use dimension.groupAll, you'd have to write reduce functions that watch the rows as they are added and removed, and keeps a count of how many unique artists it has seen. Sounds kind of tedious and possibly buggy.
Instead, we can write a "fake groupAll", an object whose .value() method returns a value dynamically computed according to the current filters.
The ordinary group object already has a unique count: the number of bins. So we can create a fake groupAll which wraps an ordinary group and returns the length of the array returned by group.all():
function unique_count_groupall(group) {
return {
value: function() {
return group.all().filter(kv => kv.value).length;
}
};
}
Note that we also have to filter out any bins of value zero before counting.
Use the fake groupAll like this:
var uniqueArtist = unique_count_groupall(dim.group());
Demo fiddle.
I just added this to the FAQ.

Plotting aggregated data with sub-columns in dc.js

I have data in the form:
data = [..., {id:X,..., turnover:[[2015,2017,2018],[2000000,3000000,2800000]]}, ...];
My goal is to plot the year in the x-axis, against the average turnover for all companies currently selected via crossfilter in the y-axis.
The years recorded per company are inconsistent, but there should always be three years.
If it would help, I can reorganise the data to be in the form:
data = [..., {id:X,..., turnover:{2015:2000000, 2017:3000000, 2018:2800000}}, ...];
Had I been able to reorganise the data further to look like:
[...{id:X, ..., year:2015, turnover:2000000},{id:X,...,year:2017,turnover:3000000},{id:X,...,year:2018,turnover:2800000}];
Then this question would provide a solution.
But splitting the companies into separate rows doesn't make sense with everything else I'm doing.
Unless I'm mistaken, you have what I call a "tag dimension", aka a dimension with array keys.
You want each row to be recorded once for each year it contains, but you only want it to affect this dimension. You don't want to observe the row multiple times in the other dimensions, which is why you don't want to flatten.
With your original data format, your dimension definition would look something like:
var yearsDimension = cf.dimension(d => d.turnover[0], true);
The key function for a tag dimension should return an array, here of years.
This feature is still fairly new, as crossfilter goes, and a couple of minor bugs were found this year. These bugs should be easy to avoid. The feature has gotten a lot of use and no major bugs have been found.
Always beware with tag dimensions, since any aggregations will add up to more than 100% - in your case 300%. But if you are doing averages across companies for a year, this should not be a problem.
pairs of tags and values
What's unique about your problem is that you not only have multiple keys per row, you also have multiple values associated with those keys.
Although the crossfilter tag dimension feature is handy, it gives you no way to know which tag you are looking at when you reduce. Further, the most powerful and general group reduction method, group.reduce(), doesn't tell you which key you are reducing..
But there is one even more powerful way to reduce across the entire crossfilter at once: dimension.groupAll()
A groupAll object behaves like a group, except that it is fed all of the rows, and it returns only one bin. If you use dimension.groupAll() you get a groupAll object that observes all filters except those on that dimension. You can also use crossfilter.groupAll if you want a groupAll that observes all filters.
Here is a solution (using ES6 syntax for brevity) of reduction functions for groupAll.reduce() that reduces all of the rows into an object of year => {count, total}.
function avg_paired_tag_reduction(idTag, valTag) {
return {
add(p, v) {
v[idTag].forEach((id, i) => {
p[id] = p[id] || {count: 0, total: 0};
++p[id].count;
p[id].total += v[valTag][i];
});
return p;
},
remove(p, v) {
v[idTag].forEach((id, i) => {
console.assert(p[id]);
--p[id].count;
p[id].total -= v[valTag][i];
})
return p;
},
init() {
return {};
}
};
}
It will be fed every row and it will loop over the keys and values in the row, producing a count and total for every key. It assumes that the length of the key array and the value array are the same.
Then we can use a "fake group" to turn the object on demand into the array of {key,value} pairs that dc.js charts expect:
function groupall_map_to_group(groupAll) {
return {
all() {
return Object.entries(groupAll.value())
.map(([key, value]) => ({key,value}));
}
};
}
Use these functions like this:
const red = avg_paired_tag_reduction('id', 'val');
const avgPairedTagGroup = turnoverYearsDim.groupAll().reduce(
red.add, red.remove, red.init
);
console.log(groupall_map_to_group(avgPairedTagGroup).all());
Although it's possible to compute a running average, it's more efficient to instead calculate a count and total, as above, and then tell the chart how to compute the average in the value accessor:
chart.dimension(turnoverYearsDim)
.group(groupall_map_to_group(avgPairedTagGroup))
.valueAccessor(kv => kv.value.total / kv.value.count)
Demo fiddle.

Neo4j Query Performance Issue

I have some performance issue using a specific Cypher Command.
I look for R nodes not directly connected to a specific set of nodes of type I (Here, nodes with index field at "79" and "4") and I want to maximize the field "score" :
MATCH (r:R), (i0:I { index:"79" }), (i1:I { index:"4" })
WHERE NOT r--i0 AND NOT r--i1
RETURN r.index
ORDER BY r.score DESC
LIMIT 5
The query is executed generally in 1250ms.
If I remove the ORDER BY clause, the request time goes down to 130ms.
The order clause iterates on nearly 3300 elements.
Any idea how I can speed up that request ? I am sure there is a way to use another syntax to perform this search.
I think it is normal, by removing the ORDER BY, he will return you the 5 first nodes he can match.
By adding the ORDER BY, it forces to load all possible matching nodes, depending of the amount of "R" nodes the time will increase.
Now :
Did you "profiled" your query with PROFILE
do you have indexes/constraints on I:index ?
Can you change slightly your query to :
MATCH (r:R), (i0:I { index:"79" }), (i1:I { index:"4" })
WHERE NOT EXISTS((r)--(i0))
AND NOT EXISTS((r)--(i1))
RETURN r.index
ORDER BY r.score DESC
LIMIT 5
Which version do you use? try to update to the latest one, also please share your visual query plan by prefixing your query with `PROFILE``
Change it to:
MATCH (i0:I { index:"79" }), (i1:I { index:"4" })
MATCH (r:R)
WHERE NOT r--i0 AND NOT r--i1
WITH r
ORDER BY r.score DESC
LIMIT 5
RETURN r.index

How can I descending sort a grouping based on the count of the reduction array in rethinkdb

Importing this dataset as a table:
https://data.cityofnewyork.us/Housing-Development/Registration-Contacts/feu5-w2e2#revert
I use the following query to perform an aggregation and then attempt to sort in descending order based on the reduction field. My intention is sort based on the count of that field or to have the aggregation create a second field called count and sort the grouping results in descending order of the reduction array count or length. How can this be done in rethinkdb?
query:
r.table("contacts").filter({"Type": "Agent","ContactDescription" : "CONDO"}).hasFields("CorporationName").group("CorporationName").ungroup().orderBy(r.desc('reduction'))
I don't quite understand what you're going for, but does this do what you want? If not, what do you want to be different in the output?
r.table("contacts")
.filter({"Type": "Agent","ContactDescription" : "CONDO"})
.hasFields("CorporationName")
.group("CorporationName")
.ungroup()
.merge(function(row){ return {count: row('reduction').count()}; })
.orderBy(r.desc('count'))
You are almost there:
r.table("contacts").filter({"Type": "Agent","ContactDescription" : "CONDO"}).hasFields("CorporationName").group("CorporationName").count().ungroup().orderBy(r.desc('reduction'))
See that .count()? That is a map-reduce operation to get the count of each group.
I haven't tested the query on your dataset. Please comment in case you had problems with it.
EDIT:
If you want to add a count field and preserve the original document, you need to use map and reduce. In your case, it should be something like:
r.table("contacts").filter({"Type": "Agent","ContactDescription" : "CONDO"})
.hasFields("CorporationName")
.group("CorporationName")
.map(r.row.merge({count:1}))
.reduce(function(left, right){
return {
count: left('count').add(right('count')),
<YOUR_OTHER_FIELDS>: left('<YOUR_OTHER_FIELDS>'),
...
};
})
.ungroup().orderBy(r.desc(r.row('reduction')('count')))
EDIT:
I am not sure if this can do the trick, but it is worth a try:
.reduce(function(left, right){
return left.merge({count: left('count').add(right('count'))})
})

Size of a filtered dimension in Crossfilter?

I've read through the Crossfilter API docs several times but can't see how to do the following.
Suppose I have set up
crossfilter(event);
and a dimension foo:
var foo = event.dimension(function(d) { return d.foo; }),
foos = foo.group(function(d) { return Math.floor(d) ; });
Then, before any filters are applied, event.size() will give me the number of records in the event, and foos.size() will give me the number of distinct records in the foo dimension
Great! Now I apply some filters by sliding brushes around. event.groupAll().value() now gives me the current number of records in event that are selected. Great again.
Now how do I get the current number of distinct records in the foo dimension? I've tried many different combinations of the API primitives, but none seem to work.
Any ideas?
This should do the trick
var n = foo.top(Number.POSITIVE_INFINITY).length;
I do not have enough reputation to comment the solution proposed by Reno.
This should do the trick
var n = foo.top(Number.POSITIVE_INFINITY).length;
The problem of this solution is that is not efficient, because top function is ordering the data.
I have the same problem that you and I have a counter in the filter to know how many entries have the dimension.

Resources