RethinkDB: count unique rows within grouped data - rethinkdb

I'm trying to count all unique rows within grouped data, i.e, how many unique rows exist within each group.
Although groupedData.distinct().count() works for relatively small amounts of rows, running it on ~200k rows, such as in my case, ends with "over size limit".
I understand why it happens, yet I can't come up with more efficient way of doing it - is there a way?

Count is an expensive thing in RethinkDB to my experience. Especially for count operation that require iterating the whole data set. I myself struggle with this for a bit before.
To my understanding, when you pass groupData to distinct, it creates an array, because groupData will be a sequence, therefore it has 100,000 element limits.
To solve this, I think we have to use a stream, and count the stream instead. We cannot use group because it returns a group of stream, or in other words, an array of stream to my understanding again.
So here is how I solve it:
Create an index on the field I want to groups
Call distnct on that table with the index.
Map the stream, passing value into a count function with getAll, using index
An example query
r.table('t').distinct({index: 'index_name'})
.map(function(value) {
return {group: value, total: r.table('t').getAll(value, {index: 'index_name'}).count()}
})
With this, everything is a stream and we can lazily iterator over result set to get the count of each group.

Related

bigtable.SampleRowKeys return a single key

I'm trying to write code that does a full table scan in go by using the bigtable.Table.SampleRowKeys RPC method. The table as around 7m rows (verified with cbt), yet the call returns a single key, whereas the documentation mention:
// SampleRowKeys returns a sample of row keys in the table. The returned row keys will delimit contiguous sections of
// the table of approximately equal size, which can be used to break up the data for distributed tasks like MapReduce.
Am I missing something?
Turns out that the returned keys are midpoints, thus if it returns a single key, say [k1], then the ranges are [("", k1), (k1, "")].

dataloader facebook optimization

everybody!
I'm trying to use Dataloader by facebook in my graphql project.
So, now I'm faced to the next problem. When I ask my database for data by ids for example: select * from books where books.author in (4,5,6,7) I got an Error: "function did not return a Promise of an Array of the same length as the Array of keys". Cause by id 4 I can fetch more then just one book.
Does anybody know how to fix it?
Dataloader is expecting you to return an array of the same length as the input to your loader. So, if the loader gets [4,5,6,7] as an input, it will need to return an array with a length of 4. Also keep in mind that the results returned from the loader need to be in the same order as the input ids. This may or may not be something you have to worry about depending on how the data is returned from your database.
You should return an array for each id - array of arrays. You have to convert sql result - flat list with duplicates into 'groupped' arrays of records preserving input ids (amount and order).

couchdb - retrieve unique documents for a view that emits non-unique two array keys

I have an map function in a view in CouchDB that emits non-unique two array keys, for documents of type message, e.g.
The first position in the array key is a user_id, the second position represents whether or not the user has read the message.
This works nicely in that I can set include_docs=true and retrieve the actual documents. However, I'm retrieving duplicate documents in that case, as you can see above in the view results. I need to be able to write a view that can be queried to return unique messages that have been read by a given user. Additionally, I need to be able to efficiently paginate the resultset.
notice in the image above that [66, true] is emitted twice for doc id 26a9a271de3aac494d37b17334aaf7f3. As far as I can tell, with the keys in my map function, I cannot reduce in such a way that unique documents will be returned.
the next idea I had was to emit doc._id also in the map function and reduce with group_level=exact the result being:
now I am able to get unique document ids, but I cannot get the documents without doing a second query. And even in the case of a second query, it will require a lot of complexity to do pagination like this (at least I think so).
the last idea I came up with is to emit the entire document rather than the doc._id in the third position in the array key, then I can access the entire document and likely paginate. This seems really brutish.
So my question is:
Is #3 above a terrible idea? Is there something I'm missing? Is there a better approach?
Thanks in advance.
See #WickedGrey's comment to the question. The solution is to ensure that I never emit the same key twice for one document. I do this in the map function by keeping track of the keys as I emit them in an array, then skipping the emit if the key exists in the array.

Salesforce SOQL query length and efficiency

I am trying to solve a problem of deleting only rows matching two criteria, each being a list of ids. Now these Ids are in pairs, if the item to be deleted has one, it must have the second one in the pair, so just using two in clauses will not work. I have come up with two solutions.
1) Use the two in clauses but then loop over the items and check that the two ids in question appear in the correct pairing.
I.E.
for(Object__c obj : [SELECT Id FROM Object__c WHERE Relation1__c in :idlist1 AND Relation2__c in:idlist2]){
if(preConstructedPairingsAsString.contains(''+obj.Relation1__c+obj.Relation2__c)){
listToDelete.add(obj);
}
}
2) Loop over the ids and build an admittedly long query.
I like the second choice because I only get the items I need and can just throw the list into delete but I know that salesforce has hangups with SOQL queries. Is there a penalty to the second option? Is it better to build and query off a long string or to get more objects than necessary and filter?
In general you want to put as much logic as you can into soql queries because that won't use any script statements and they execute faster than your code. However, there is a 10k character limit on soql queries (can be raised to 20k) so based on my back of the envelope calculations you'd only be able to put in 250 id pairs or so before hitting that limit.
I would go with option 1 or if you really care about efficiency you can create a formula field on the object that pairs the ids and filter on that.
formula: relation1__c + '-' + relation2__c
for(list<Object__c> objs : [SELECT Id FROM Object__c WHERE formula__c in :idpairs]){
delete objs;
}

Windows Azure Paging Large Datasets Solution

I'm using Windows Azure Table Storage to store millions of entities, however I'm trying to figure out the best solution that easily allows for two things:
1) a search on an entity, will retrieve that entity and at least (pageSize) number of entities either side of that entity
2) if there are more entities beyond (pageSize) number of entities either side of that entity, then page next or page previous links are shown, this will continue until either the start or end is reached.
3) the order is reverse chronological order
I've decided that the PartitionKey will be the Title provided by the user as each container is unique in the system. The RowKey is Steve Marx's lexiographical algorithm:
http://blog.smarx.com/posts/using-numbers-as-keys-in-windows-azure
which when converted to javascript instead of c# looks like this:
pad(new Date(100000000 * 86400000).getTime() - new Date().getTime(), 19) + "_" + uuid()
uuid() is a javascript function that returns a guid and pad adds zeros up to 19 chars in length. So records in the system look something like this:
PK RK
TEST 0008638662595845431_ecf134e4-b10d-47e8-91f2-4de9c4d64388
TEST 0008638662595845432_ae7bb505-8594-43bc-80b7-6bd34bb9541b
TEST 0008638662595845433_d527d215-03a5-4e46-8a54-10027b8e23f8
TEST 0008638662595845434_a2ebc3f4-67fe-43e2-becd-eaa41a4132e2
This pattern allows for every new entity inserted to be at the top of the list which satisfies point number 3 above.
With a nice way of adding new records in the system I thought then I would create a mechanism that looks at the first half of the RowKey i.e. 0008638662595845431_ part and does a greater than or less than comparison depending on which direction of the already found item. In other words to get the row immediately before 0008638662595845431 I would do a query like so:
var tableService = azure.createTableService();
var minPossibleDateTimeNumber = pad(new Date(-100000000*86400000).getTime() - new Date().getTime(), 19);
tableService.getTable('testTable', function (error) {
if (error === null) {
var query = azure.TableQuery
.select()
.from('testTable')
.where('PartitionKey eq ?', 'TEST')
.and('RowKey gt ?', minPossibleDateTimeNumber + '_')
.and('RowKey lt ?', '0008638662595845431_')
.and('Deleted eq ?', 'false');
If the results returned are greater than 1000 and azure gives me a continuation token, then I thought I would remember the last items RowKey i.e. the number part 0008638662595845431. So now the next query will have the remembered value as the starting value etc.
I am using Windows Azure Node.Js SDK and language is javascript.
Can anybody see gotcha's or problems with this approach?
I do not see how this can work effectively and efficiently, especially to get the rows for a previous page.
To be efficient, the prefix of your “key” needs to be a serially incrementing or decrementing value, instead of being based on a timestamp. A timestamp generated value would have duplicates as well as holes, making mapping page size to row count at best inefficient and at worst difficult to determine.
Also, this potential algorithm is dependent on a single partition key, destroying table scalability.
The challenge here would be to have a method of generating a serially incremented key. One solution is to use a SQL database and performing an atomic update on a single row, such that an incrementing or decrementing value is produced in sequence. Something like UPDATE … SET X = X + 1 and return X. Maybe using a stored procedure.
So the key could be a zero left padded serially generated number. Split such that say the first N digits of the number is the partition key and remaining M digits are the row key.
For example
PKey RKey
00001 10321
00001 10322
….
00954 98912
Now, since the rows are in sequence it is possible to write a query with the exact key range for the page size.
Caveat. There is a small risk of a failure occurring between generating a serial key and writing to table storage. In which case, there may be holes in the table. However, your paging algorithm should be able to detect and work around such instances quite easily by specify a page size slightly larger than necessary or by retrying with an adjusted range.

Resources