Sorting by a non-key (arbitrary) field in CouchDB - sorting

I have a fairly large CouchDB database (approximately 3 million documents). I have various view functions returning slices of the data that can't be modified (or at least, should only be modified as a last resort).
I need the ability to sort on an arbitrary field for reporting purposes. For smaller DBs, I return the entire object, json_parse it in our PHP backend, then sort there. However, we're often getting Out Of Memory errors when doing this on our largest DBs.
After some research, I'm leaning towards accessing a sort key (via URL parameter) in a list function and doing the sort there. This is an idea I've stolen from here. Excerpt:
function(head, req) {
var row
var rows=[]
while(row = getRow()) {
rows.push(row)
}
rows.sort(function(a,b) {
return b.value-a.value
})
send(JSON.stringify({"rows" : rows}))
}
It seems to be working for smaller DBs, but it still needs a lot of work to be production ready.
Is this:
a) a good solution?
b) going to work with 3, 5, or 10 million rows?

You can't avoid loading everything into memory by using a list function. So with enough data, eventually, you'll get an out of memory error, just as you're getting with PHP.
If you can live within the memory constrains, it's a reasonable solution, with some advantages.
Otherwise, investigate using something like lucene, elasticsearch, or Cloudant Search (clouseau & dreyfus).

In our environment, we have more than 5 million records. The couch is design such that each and every Document has some specific fields which distinguish it from the other category of documents.
For example, there are number documents with field DocumentType "USer" or DocumentType "XXX"
These DocumentType field allow us to sort various document based on different categories.
So if you have 3 Million doc, and you have around 10 categories so each category will have about 300k Docs.
Now you can design system such that you always pass the DocId you need to be passed to Couch. In that way it will be faster.
so query can be like
function(doc)
{
if(doc.DocumentType=== 'XXX' && doc._id) {emit(doc.FieldYouWant, doc._id)}
}
This is how our backhand is designed in production.

Related

Elastic Search: Modelling data containing variable fields

I need to store data that can be represented in JSON as follows:
Article{
Id: 1,
Category: History,
Title: War stories,
//Comments could be pretty long and also be changed frequently
Comments: "Nice narration, Reminds me of the difficult Times, Tough Decisions"
Tags: "truth, reality, history", //Might change frequently
UserSpecifiedNotes:[
//The array may contain different users for different articles
{
userid: 20,
note: "Good for work"
},
{
userid: 22,
note: "Homework is due for work"
}
]
}
After having gone through different articles, denormalization of data is one of the ways to handle this data. But since common fields could be pretty long and even be changed frequently, I would like to not repeat it. What could be the other ways better ways to represent and search this data? Parent-child? Inner object?
Currently, I would be dealing with a lot of inserts, updates and few searches. But whenever search is to be done, it has to be very fast. I am using NEST (.net client) for using elastic search. The search query to be used is expected to work as follows:
Input: searchString and a userID
Behavior: The Articles containing searchString in either Title, comments, tags or the note for the given userIDsort in the order of relevance
In a normal scenario the main contents of the article will be changed very rarely whereas the "UserSpecifiedNotes"/comments against an article will be generated/added more frequently. This is an ideal use case for implementing parent-child relation.
With inner object you still have to reindex all of the "man article" and "UserSpecifiedNotes"/comments every time a new note comes in. With the use of parent-child relation you will be just adding a new note.
With the details you have specified you can take the approach of 4 indices
Main Article (id, category, title, description etc)
Comments (commented by, comment text etc)
Tags (tags, any other meta tag)
UserSpecifiedNotes (userId, notes)
Having said that what need to be kept in mind is your actual requirement. Having parent-child relation will need more memory, and ma slow down search performance a tiny bit. But indexing will be faster.
On the other hand a nested object will increase your indexing time significantly as you need to collect all the data related to an article before indexing. You can of course store everything and just add as an update. As a simpler maintenance and ease of implementation I would suggest use parent-child.

Limiting Data With Lucene.NET

We are using Sql Server 2012 Full-Text indexing, however we would like to move our database to Sql Azure. Using the migration tool it is telling us that Full-Text indexing is not compatible with Sql Azure (even v12 which is in preview does not support it so it doesn't look like they intend to support it).
Because of this we are looking at alternatives and the best I have found so far is using Lucene.NET with AzureDirectory (https://azuredirectory.codeplex.com). This will allow us to store the index in blob storage and cache it locally on the file system of the VMs which host the web sites (also in Azure).
The issue we have is that the data we intend to index is items such as news stories which are not visible to all users because of a publishing model we have which limits items to be visible to just a subset of the users. With Full-Text indexing when searching for a news story we can limit the data for the using searching using a simply join on what is visible to them, however with Lucene we will not be able to do this.
The idea we have come up with is to take news stories in the index with a collection of UserIds that are allowed to view that news story, I am afraid I am very new to Lucene and I cannot work out the best way to do this, we are adding the index for a news story like so:
document.Add(new Field("Title",
news.Title,
Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.NO));
document.Add(new Field("Content",
news.Content,
Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.NO));
However if we have a collection of userIds defined as
IEnumerable<int>
How could we add these to the news story index and then search on them effectively for a given user id. Additionally what will the performance hit be if we are adding 100s or 1000s of UserIds against a lucene document. Is there a better way to go than down this road as this might be a terrible idea (probably is a terrible idea)?
I also ran into that problem while migrating to Azure and have ended up with that same permissions model. Since your userIds are integers and won't have special characters then you can rely on many of the Lucene(.net) analyzers like StandardAnalyzer and WhitespaceAnalyzer to split a list of IDs into terms as long as you input a string. Just separate each ID with a space or comma depending on what the Analyzer will split on.
You should be able to do something simple like this to index the IDs...
IEnumerable<int> userIds = new int[] { 123, 456, 789 };
document.Add(new Field("AllowedUserIDs",
String.Join(" ", userIds),
Field.Store.NO,
Field.Index.ANALYZED_NO_NORMS));
Then just make sure to query with a TermQuery to match whole terms(IDs). Something like...
int currentUserID = 123;
string queryString = "airplane";
BooleanQuery query = new BooleanQuery();
query.Add(new TermQuery(new Term("AllowedUserIDs", currentUserID.ToString())), Occur.MUST);
query.Add(new TermQuery(new Term("Title", queryString)), Occur.SHOULD);
query.Add(new TermQuery(new Term("Content", queryString)), Occur.SHOULD);
I can't speak very specifically to the performance concerns but we have a few hundred IDs in our lists and it hasn't seemed to impact the query times since we added it. Really it's no different than searching on a few hundred or few thousand word news article.

Get users rank from couchDB

I'm trying to get the users rank from a couchDB database. The issue I'm having is I have multiple users and multiple games. I want to be able to pass 2 keys
The app id
The users score
I would like to then see how many records have the same app id and a lower score then the one I passed. This would return the users current rank. This is how my document structure is
{
"_id": "c68d16e1d8ba65accf97230dfbf7c2cb",
"_rev": "114-2aea3eef75c73e1079ed9c8d945723e1",
"credits": 2125,
"appName": "someApp"
}
I've tried setting views up but the multiple keys are really confusing me. This is what I've tried but hasn't worked
VIEW
"getrank": {
"map": "function(doc) { emit([doc.appName, doc.credits],{credits:doc.credits}) }"
}
URL CALLS I'VE TRIED
/players/_design/views/_view/getrank?key=["someApp","2000"]&startkey=["credits",2000]
/players/_design/views/_view/getrank?key=someApp"&startkey=["credits",2000]
I would like to then see how many records have the same app id and a lower score then the one I passed.
If I understand your question correctly your view looks good. Maybe instead of emitting an object you can just do doc.credits or simply null and query it with &include_docs..
Any way what you need to do is to query over a range. startkey and endkey should work.
_view/getrank?startkey=["someApp",minima]&endkey=["someapp",maxima]
what this query does is give you records for someapp between minima and maxima. Now we need to build upon this.
lower score then the one I passed.
first we need to query it in a descending manner. The only interesting thing here is that the order of keys will reverse:-
_view/getrank?startkey=["someApp",maxima]&endkey=["someapp",minima]&descending=true
now suppose you want everything lower that 9000. Here is the final query that will do the trick
_view/getrank?startkey=["someApp",9000]&endkey=["someapp",{}]
This gives you all the scores for some app less than 9000.
I have not actually run these queries but this should give you something to work with.
If you need all the records over a range you need range queries.
Range queries are done with startkey and endkey.
They are reversed when descending=true.
Hope this helps.

Passing parameters to a couchbase view

I'm looking to search for a particular JSON document in a bucket and I don't know its document ID, all I know is the value of one of the sub-keys. I've looked through the API documentation but still confused when it comes to my particular use case:
In mongo I can do a dynamic query like:
bucket.get({ "name" : "some-arbritrary-name-here" })
With couchbase I'm under the impression that you need to create an index (for example on the name property) and use startKey / endKey but this feels wrong - could you still end up with multiple documents being returned? Would be nice to be able to pass a parameter to the view that an exact match could be performed on. Also how would we handle multi-dimensional searches? i.e. name and category.
I'd like to do as much of the filtering as possible on the couchbase instance and ideally narrow it down to one record rather than having to filter when it comes back to the App Tier. Something like passing a dynamic value to the mapping function and only emitting documents that match.
I know you can use LINQ with couchbase to filter but if I've read the docs correctly this filtering is still done client-side but at least if we could narrow down the returned dataset to a sensible subset, client-side filtering wouldn't be such a big deal.
Cheers
So you are correct on one point, you need to create a view (an index indeed) to be able to query on on the content of the JSON document.
So in you case you have to create a view with this kind of code:
function (doc, meta) {
if (doc.type == "youtype") { // just a good practice to type the doc
emit(doc.name);
}
}
So this will create a index - distributed on all the nodes of your cluster - that you can now use in your application. You can point to a specific value using the "key" parameter

How to realize complex search filters in couchdb? Should I avoid temporary views?

I want to administrate my User-entities in a grid. I want to sort them and I want to have a search filter for each column.
My dynamic generated temporary view works fine:
function(doc){
if(doc.type === 'User' &&
// Dynamic filters: WHERE firstName LIKE '%jim%' AND lastName LIKE '%knopf%'
(doc.firstName.match(/.*?jim.*?/i) &&
doc.lastName.match(/.*?knopf.*?/i)) ) {
// Dynamic sort
emit(doc.lastName, doc);
}
}
BUT everywhere is written you have to AVOID temporary views. IS there a better way? Should I save these searches on demand at runtime?
Thanks
You should definitely not use temporary views, as they have to be recomputed every single time they are queried. (which is a very "expensive" process) A stored view is perfect when you know the fields you are searching ahead of time. (it builds the index one time, only making incremental changes after that)
However, you won't be able to get "contains" searches. (you can get exact matches and "starts with" matches, but that's not what your example shows) If you need ad-hoc querying, you should seriously consider couchdb-lucene.

Resources