couchdb - retrieve unique documents for a view that emits non-unique two array keys - filter

I have an map function in a view in CouchDB that emits non-unique two array keys, for documents of type message, e.g.
The first position in the array key is a user_id, the second position represents whether or not the user has read the message.
This works nicely in that I can set include_docs=true and retrieve the actual documents. However, I'm retrieving duplicate documents in that case, as you can see above in the view results. I need to be able to write a view that can be queried to return unique messages that have been read by a given user. Additionally, I need to be able to efficiently paginate the resultset.
notice in the image above that [66, true] is emitted twice for doc id 26a9a271de3aac494d37b17334aaf7f3. As far as I can tell, with the keys in my map function, I cannot reduce in such a way that unique documents will be returned.
the next idea I had was to emit doc._id also in the map function and reduce with group_level=exact the result being:
now I am able to get unique document ids, but I cannot get the documents without doing a second query. And even in the case of a second query, it will require a lot of complexity to do pagination like this (at least I think so).
the last idea I came up with is to emit the entire document rather than the doc._id in the third position in the array key, then I can access the entire document and likely paginate. This seems really brutish.
So my question is:
Is #3 above a terrible idea? Is there something I'm missing? Is there a better approach?
Thanks in advance.

See #WickedGrey's comment to the question. The solution is to ensure that I never emit the same key twice for one document. I do this in the map function by keeping track of the keys as I emit them in an array, then skipping the emit if the key exists in the array.

Related

Redis : Get all keys by providing one of the value in the values list

In redis I'm planning to store key as a unique string and value will be a list.
I have a use case where I need to do 2 things.
First, I need to get all the values associated with a key by providing the key as input.
Second, I want to get all the keys associated with a value by providing one of the value in the values list.
Second part is where I need the advice, how we can achive this ?
I cannot get all the keys or key value pair and loop through because I will have millions of entries in Redis.
As mentioned in the comment above the retrieving of all keys with associated value at will probably sometimes create a performance issue as this will be a run through large entries.As also suggested in the official documentation about retrieving data from the memory caches you can try and use the following Redis command to get the value and see if that is what can solve your purpose.
GET
MGET

dataloader facebook optimization

everybody!
I'm trying to use Dataloader by facebook in my graphql project.
So, now I'm faced to the next problem. When I ask my database for data by ids for example: select * from books where books.author in (4,5,6,7) I got an Error: "function did not return a Promise of an Array of the same length as the Array of keys". Cause by id 4 I can fetch more then just one book.
Does anybody know how to fix it?
Dataloader is expecting you to return an array of the same length as the input to your loader. So, if the loader gets [4,5,6,7] as an input, it will need to return an array with a length of 4. Also keep in mind that the results returned from the loader need to be in the same order as the input ids. This may or may not be something you have to worry about depending on how the data is returned from your database.
You should return an array for each id - array of arrays. You have to convert sql result - flat list with duplicates into 'groupped' arrays of records preserving input ids (amount and order).

Extract a column from crossfilter

Heres my problem. I have a working dc.js based dashboard and some data within it. A column of the data contains text data (twitter info). Is it somehow possible to extract that specific column from crossfilter? My aim is to create some charts and the crossfilter containing the text data should feed into a d3 based word cloud so that i can do the drill down based filtering as well which dc and crossfilter provide out of the box. I tried a dimension.top(infinity) but that returns all the key value pairs in the data. I just need the values for a particular key across the whole data set. I hope my question makes some sense.
EDIT:
More research reveals that the wordcloud will accept data in key value pair where the key is the word and value is its frequency of appearance. So i am guessing that will need to be implemented as well. If there is a ready to implement library out there kindly let me know as well. This changes things a bit as far as crossfilter is concerned.I need to throw this calculated key value pair (fit for the word cloud consumption) whenever a filter is triggered. How to go about it?
Looking forward to hearing from you all.
Best,
Anmol
Answer to the first part of the question: Probably dimension.top(Infinity) and then use an accessor to get the values you need. Not exactly efficient, but it is what it is.
Answer to the 2nd part of the question:
You need groupAll, I think. You want to take a tweet, generate an array of tokens (words), then generate a Crossfilter grouping that is a count per word, right? You can code your own custom crossfilter.dimension.groupAll reducers (if you want to do that, create a working example and I can probably cook it up). Or if you want to use Reductio:
tweetWords = data.dimension(function(d) { return d.tweetText.split(' '); });
wordCounts = tweetWords.groupAll();
reducer = reductio()
.groupAll(function(d) {
return d.tweetText.split(' ');
})
.count(true);
reducer(wordCounts);
wordCounts.all();
If you want to filter on this dimension you'll have to override the filter handler and check if the group key is in the dimension array for the record using a filterFunction.

RethinkDB: count unique rows within grouped data

I'm trying to count all unique rows within grouped data, i.e, how many unique rows exist within each group.
Although groupedData.distinct().count() works for relatively small amounts of rows, running it on ~200k rows, such as in my case, ends with "over size limit".
I understand why it happens, yet I can't come up with more efficient way of doing it - is there a way?
Count is an expensive thing in RethinkDB to my experience. Especially for count operation that require iterating the whole data set. I myself struggle with this for a bit before.
To my understanding, when you pass groupData to distinct, it creates an array, because groupData will be a sequence, therefore it has 100,000 element limits.
To solve this, I think we have to use a stream, and count the stream instead. We cannot use group because it returns a group of stream, or in other words, an array of stream to my understanding again.
So here is how I solve it:
Create an index on the field I want to groups
Call distnct on that table with the index.
Map the stream, passing value into a count function with getAll, using index
An example query
r.table('t').distinct({index: 'index_name'})
.map(function(value) {
return {group: value, total: r.table('t').getAll(value, {index: 'index_name'}).count()}
})
With this, everything is a stream and we can lazily iterator over result set to get the count of each group.

CouchDb filter and sort in one view

I'm new to the CouchDb.
I have to filter records by date (date must be between two values) and to sort the data by the name or by the date etc (it depends on user's selection in the table).
In MySQL it looks like
SELECT * FROM table WHERE date > "2015-01-01" AND date < "2015-08-01" ORDER BY name/date/email ASC/DESC
I can't figure out if I can use one view for all these issues.
Here is my map example:
function(doc) {
emit(
[doc.date, doc.name, doc.email],
{
email:doc.email,
name:doc.name,
date:doc.date,
}
);
}
I try to filter data using startkey and endkey, but I'm not sure how to sort data in this way:
startkey=["2015-01-01"]&endkey=["2015-08-01"]
Can I use one view? Or I have to create some views with keys order depending on my current order field: [doc.date, doc.name, doc.email], [doc.name, doc.date, doc.email] etc?
Thanks for your help!
As Sebastian said you need to use a list function to do this in Couch.
If you think about it, this is what MySQL is doing. Its query optimizer will pick an index into your table, it will scan a range from that index, load what it needs into memory, and execute query logic.
In Couch the view is your B-tree index, and a list function can implement whatever logic you need. It can be used to spit out HTML instead of JSON, but it can also be used to filter/sort the output of your view, and still spit out JSON in the end. It might not scale very well to millions of documents, but MySQL might not either.
So your options are the ones Sebastian highlighted:
view sorts by date, query selects date range and list function loads everything into memory and sorts by email/etc.
views sort by email/etc, list function filters out everything outside the date range.
Which one you choose depends on your data and architecture.
With option 1 you may skip the list function entirely: get all the necessary data from the view in one go (with include_docs), and sort client side. This is how you'll typically use Couch.
If you need this done server side, you'll need your list function to load every matching document into an array, and then sort it and JSON serialize it. This obviously falls into pieces if there are soo many matching documents that they don't even fit into memory or take to long to sort.
Option 2 scans through preordered documents and only sends those matching the dates. Done right this avoids loading everything into memory. OTOH it might scan way too many documents, trashing your disk IO.
If the date range is "very discriminating" (few documents pass the test) option 1 works best; otherwise (most documents pass) option 2 can be better. Remember that in the time it takes to load a useless document from disk (option 2), you can sort tens of documents in memory, as long as they fit in memory (option 1). Also, the more indexes, the more disk space is used and the more writes are slowed down.
you COULD use a list function for that, in two ways:
1.) Couch-View is ordered by dates and you sort by e-amil => but pls. be aware that you'd have to have ALL items in memory to do this sort by e-mail (i.e. you can do this only when your result set is small)
2.) Couch-View is ordered by e-mail and a list function drops all outside the date range (you can only do that when the overall list is small - so this one is most probably bad)
possibly #1 can help you

Resources